Audisto Encoding Checker

How to detect encoding issues

Wrong encoding can cause several kinds of issues with user experience as well as with the presence of a site in the search results.

With the encoding hints, you can identify encoding related problems, like problems with the charset or special chars. The Audisto crawler checks for typical issues with encoding, that are worth being checked on a regular base.

Example: Audisto Encoding Check with the Encoding hint reports for the current crawl

Audisto Encoding Check with the Encodimg hint reports for the current crawl

You may also see our guides section for more information on non printable characters and control characters.

Here is a list of all specific hints related to encoding on your website, that can be identified with the help of the Audisto Crawler.

Table Of Content

Hints

<html> contains too many uncommon non-printable characters

Description

The HTML documents contains too many uncommon non-printable characters, and not all will be shown in live analysis. With this report you can discover all URLs on the crawled website that contain more than 50 uncommon non-printable characters. See the corresponding hint "<html> contains uncommon non-printable characters" for further information about what is "uncommon".

Importance

Non-printable characters are used as control characters and may not be visible in the source code, but nonetheless impact the behaviour of the site. This might affect crawling and user experience when they are inside of an anchor's href or an image's src attribute, possibly resulting in issues with the site's structure and ranking.

Finding too many non printable characters may be a hint for massive encoding issues in a document or documents that are not HTML documents.

Operating Instruction

Non-printable characters generally should be encoded as HTML entities and removed whenever possible. If validating transferred data in an application, the validation should check for non printable characters and probably remove them.

<html> contains uncommon non-printable characters

Description

If uncommon non-printable characters are detected, the URL of the document containing the character will be flagged.

There are non printable characters, that will appear in almost every document, e.g. line feed (\n), carriage return (\r), horizontal tab (\t). In addition, there are commonly used non-printable characters, e.g. BOM, Soft hyphen, Left-To-Right-Mark and Right-To-Left-Mark. These characters will not cause the URL to get flagged with this hint. This hint detects all remaining non printable characters.

Examples

Due to the non printable nature of these characters, you'll find the character codes instead of the actual characters in the live analysis, enclosed by brackets.

[[&#xEFBBBF;]]
Importance

Non-printable characters may not be visible in the source code, but nonetheless impact

  • the behaviour of the site, e.g. when they are inside of an anchor's href or an image's src attribute
  • the ranking of the site, e.g. when they are an invisible part of a word.

This might affect crawling and user experience, possibly resulting in issues with accessibility and ranking.

Usually this hint is triggered by messed up encoding.

Operating Instruction

Non-printable characters generally should be encoded as HTML entities and removed whenever possible. If validating transferred data in an application, the validation should check for non printable characters and probably remove them.

<html> contains unencoded Left-To-Right-Mark or Right-To-Left-Mark

Description

A Left-To-Right- or Right-To-Left-Mark was found, but it is unescaped. Discover all URLs that contain an unescaped Left-To-Right-Mark or an unescaped Right-To-Left-Mark.

Examples
Character Name Detected Character HTML Entity (named) HTML Entity (decimal) HTML Entity (hex)
Left-To-Right-Mark U+200E &lrm; &#8206; &#x200e;
Right-To-Left-Mark U+200F &rlm; &#8207; &#x200f;
Importance

The Left-To-Right- and Right-To-Left-Mark are non-printable characters used for typesetting of bi-directional text. The Left-To-Right- or Right-To-Left mark are not visible and, if used without being properly escaped, may lead to a range of unexpected problems that are hard to track down due to the invisible nature of these characters:

  • Issues with the appearance of the website
  • Issues with characters ending up to be used in a URL
Operating Instruction

If unencoded Left-To-Right- or Right-To-Left-Marks are discovered,

  • escape them or
  • remove them completely and
  • if the functionality is required, switch to a CSS solution

when ever possible.

If escaping the characters, you should prefer the named HTML entities (&lrm; and &rlm;) over decimal or hex HTML entities.

<html> contains unencoded soft hyphen (SHY)

Description

If an unescaped soft hyphen was found, the URL is flagged with this hint. Discover all URLs on the crawled website, that contain unencoded soft hyphen.

Examples
Character Name Detected Character HTML Entity (named) HTML Entity (decimal) HTML Entity (hex)
soft hyphen U+00AD &shy; &#173; &#xad;
Character Name Detected Character HTML Entity (named) HTML Entity (decimal)
Left-To-Right-Mark U+200E &lrm; &#8206;
Right-To-Left-Mark U+200F &rlm; &#8207;
Importance

The unencoded soft hyphen is a character that is used for hyphenation of words. The soft hyphen is only visible if the word needs to be hyphenated on a line break. That characteristic can lead to

  • unexpected hyphenation
  • hard to track down issues with the appearance of the website
  • the character ending up in a URL
Operating Instruction

If unencoded soft hyphens are discovered, escape them or remove them completely if they are not required.

If escaping the characters, you should prefer the named HTML entity (&shy;) over the decimal or hex HTML entity.

<html> starts with BOM

Description

There is an unicode byte order mark (BOM) at top of the HTML. Discover all URLs on the crawled website, that contain a BOM.

We currently detect BOM in the following encoding:

  • UTF-8
  • UTF-16 BE/LE
  • UTF-32 BE/LE
  • UTF-7
  • UTF-1
  • UTF-EBCDIC
  • SCSU
  • BOCU-1
  • GB-18030
Examples

Example UTF-8 BOM in HTML 5

EF BB BF<!DOCTYPE html>
<html lang="en">

How BOM looks in different encoding and representations:

Encoding BOM hex BOM dec
UTF-8 EF BB BF 239 187 191
UTF-16 (BE) FE FF 254 255
UTF-16 (LE) FF FE 255 254
UTF-32 (BE) 00 00 FE FF 0 0 254 255
UTF-32 (LE) FF FE 00 00 255 254 0 0
UTF-7 2B 2F 76 38 43 47 118 56
2B 2F 76 39 43 47 118 57
2B 2F 76 2B 43 47 118 43
2B 2F 76 2F 43 47 118 47
2B 2F 76 38 2D 43 47 118 56 45
F7 64 4C 247 100 76
UTF-EBCDIC DD 73 66 73 221 115 102 115
SCSU 0E FE FF 14 254 255
BOCU-1 FB EE 28 251 238 40
GB-18030 84 31 95 33 132 49 149 51
Importance

The unicode byte order mark is the unicode character U+FEFF. Some text editors add it to documents. The BOM is used to signal:

  • the byte order, or endianness
  • the fact that the text is unicode
  • the specific unicode encoding

Having a unique byte order mark on top of the HTML, is valid but might result in problems with 3rd party software. As of HTML5, a BOM is supposed to override the charset definition from the HTTP header. If the BOM is used for charsets that are not unicode, this might lead to encoding problems. Encoding problems may lead to issues with the appearance of the site in browsers and search engines and therefore lead to issues with user experience.

Operating Instruction

You should consider removing the BOM and specify the encoding in the HTTP header or as a meta tag in the HTML <head>.

Charset: Charset set in HTTP Content-Type header and in document differ.

Description

Both the document and the HTTP Content-Type header specify a charset, but these are not identical. Discover all occurences of conflicting duplicate charset definitions on the crawled website.

Examples

HTTP header

HTTP/1.1 200 OK
Server: Apache
Date: Thu, 17 Dec 2015 15:34:23 GMT
Content-Type: text/html; charset=UTF-8
...

meta charset (HTML 5)

<meta charset="iso-8859-1">

meta content-equiv (HTML 4.01)

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

XML

<?xml encoding="iso-8859-1" ?>
Importance

If charset definitions in the HTTP header and the document differ, the browser has to use a heuristic to guess the correct charset to display the document. This may lead to problems handling the encoding of the document and slow down the rendering time for the document.

Note: There are multiple ways to specify the charset in the document that may cause the conflict, e.g. <?xml>, <meta charset> and <meta content type>.

Operating Instruction

We suggest to set a proper charset in the HTTP header and in the document to make it easy for web clients to render the document fast and as expected. Make sure the defined charsets are identical and not conflicting.

Charset: Invalid charset in Content-Type HTTP header

Description

The Content-Type HTTP header does specify an invalid charset. Discover all occurences of invalid charset definitions in Content-Type HTTP headers on the crawled website.

Examples
HTTP/1.1 200 OK
Server: Apache
Date: Thu, 17 Dec 2015 15:34:23 GMT
Content-Type: text/html; charset=foo-bar
...
Importance

If there is no valid charset defined in the HTTP header, the browser has to use the charset specified in the document or has to fall back to detect the charset to display the document. If the charset has to be guessed, this may lead to problems handling the encoding of the document. Additionally, this may slow down the rendering time for the document.

Operating Instruction

We suggest to set a proper charset in the HTTP header and in the document to make it easy for web clients to render the document fast and as expected. Make sure the defined charsets are identical and not conflicting.

Charset: Not set

Description

There is no charset set, neither in the Content-Type HTTP header, nor in the document, e.g. through a <meta> tag.

Importance

If there is no charset defined in the HTTP header, the browser has to fall back to detect the charset to display the document. If the charset has to be guessed, this may lead to problems handling the encoding of the document. Additionally, this may slow down the rendering time for the document.

Operating Instruction

We suggest to set a proper charset in the HTTP header and in the document to make it easy for web clients to render the document fast and as expected. Make sure the defined charsets are identical and not conflicting.

Charset: Not set in Content-Type HTTP header

Description

The Content-Type HTTP header does not specify a charset. Discover all URLs on the crawled website, where the HTTP Content-Type header did not specify a charset. However, there may be a charset defined in the document.

Importance

If there is no charset defined in the HTTP header, the browser has to use the charset specified in the document or has to fall back to detect the charset to display the document. If the charset has to be guessed, this may lead to problems handling the encoding of the document. Additionally, this may slow down the rendering time for the document.

Operating Instruction

We suggest to set a proper charset in the HTTP header and in the document to make it easy for web clients to render the document fast and as expected. Make sure the defined charsets are identical and not conflicting.

Charset: Not set in document

Description

There is no charset set in the document, e.g. through a <meta> tag. Discover all URLs on the crawled website, that do not define a charset in the document. However, there may be a charset defined in the HTTP header.

Importance

If there is no charset defined in the document, the browser has to use the charset in the HTTP header or has to fall back to detect the charset to display the document. If the charset has to be guessed, this may lead to problems handling the encoding of the document. Additionally, this may slow down the rendering time for the document.

Operating Instruction

We suggest to set a proper charset in the HTTP header and in the document to make it easy for web clients to render the document fast and as expected. Make sure the defined charsets are identical and not conflicting.