Deal with non printable characters and control characters
Non printable characters are a nightmare! This article shows how to find, show, and fix problems caused by some non printable characters within websites.
What are non printable characters?
Most webpages use the non printable characters for horizontal tab, line feed and carriage return. All of those tend to have a visual impact within the source code and are quite easy to detect and to fix if you have any problems.
The hard part starts when you have to deal with non printable characters and control characters without any visual impact like null, backspace, escape, bell and so on. In addition there are also non printable characters that only have a visual impact under specific circumstances. The soft hyphen / shy-character only has a visual impact when breaking words across lines. The Left-To-Right-Mark and the Right-To-Left-Mark only have a visual impact if they are used within a text that is written from the opposite direction.
What problems can non printable characters cause?
A customer had some articles with non printable characters and control characters and they found their way into various element of his website. Invisible as they are, they made it even past the function that generates human readable urls and caused a lot of trouble within the link structure of the site. The results were obvious in his crawl report but finding the source was difficult because nothing was visible.
Another customer used data from an external service and ended up with non printable characters within his data as a result of encoding problems. The non printable characters were everywhere, even within some words. Search engines handled some of them as word boundaries and tests showed that those words could not be found within a searchresult.
Detect and show non printable characters and control characters
To detect these problems we developed the following hints:
- <html> contains non-printable characters
- <html> contains too many non-printable characters
- <html> contains unencoded soft hyphen (SHY)
- <html> contains unencoded Left-To-Right-Mark or Right-To-Left-Mark
The hints are triggered whenever a pages contains non printable characters other than horizontal tab, line feed, carriage return, the soft hyphen, the Left-To-Right-Mark or the Right-To-Left-Mark.
When you access a page report you can perform a live analysis of the specific page. Within the live analysis we extract all sections of the page that contain non printable characters and show the character code of the non printable character or control character as HTML escaped hexadecimal digits to make the character visible.
Soft hyphen / Shy-character
In addition to the general hints we also have a specific hint for the soft hyphen (ISO 8859: 0xAD, Unicode U+00AD) because it is used quite often. Using the soft hyphen should not be a problem. However you might want to encode it as ­ or ­ or even remove the soft hypen characters in specific sections.
Left-To-Right-Mark and Right-To-Left-Mark
We also have a hint for pages that contain a Left-To-Right-Mark (Unicode U+200E) or the Right-To-Left-Mark (U+200F). Using the Left-To-Right-Mark or the Right-To-Left-Mark should not be a problem. However you might want to encode the Left-To-Right-Mark as ‎ or ‎ and the Right-To-Left-Mark as ‏ or ‏ to make the marks visible within your HTML code.
Fix problems caused by non printable characters
Once you've identified the section of your page that contains the non printable characters you should be able to encode oder remove them. Afterwards you can just perform a new crawl or perform a new live analysis to check the results.
If your software allows user input or imports data from external datasources you might want to implement code that checks for non printable characters before storing the input.