An URL generally points to a web page, an image, a CSS file, or anything else on your site. All this different kind of content is called a resource.
From time to time there are two URLs pointing to the same resource. For example
www.domain.com/index.html often identify a domain's main page, resulting in duplicate content.
While sometimes duplicate content is created accidentally, often it is just because the same URL may be expressed
in more than one way. For example
http://www.example.com:80/path always will refer to the same
This is where URL normalization kicks in. It defines a set of rules to transform one URL into another, making them easily comparable. URL normalization is defined in RFC 3986. These rules are implemented by Audisto, but can in parts be turned on or off.
Project, crawl and monitor settings offer two URL normalization modes:
If the URL normalization is set to None, only the most basic normalization rules are applied:
Within an URL,
/../ have special meanings, because they refer to the current
or parent directory. Audisto correctly resolves these path elements.
Within an URL, only a small set of characters are allowed, such as letters, digits and some special characters like an underscore or a hyphen. Everything else has to be encoded using a percent sign and a two digit hexadecimal code. This also holds for a space. Audisto does these automatically.
Both the HTTP and the HTTPS protocol have a default port assigned, which is 80 in case of HTTPS and 443 in case of HTTPS. Audisto strips this ports, if they are part of an URL.
Audisto always sets
/ as path for HTTP and HTTPS protocol, if no path was given.
Audisto always turns the host and the protocol of a domain to lower case.
If the URL normalization is set to Full, a lot more rules are used to normalize URLs. Some of these rules are changing the semantics of an URL, so that the normalized URL is - in a strict sense - not the equivalent to its source. However, given a moderately sane backend implementation, in real live the normalized URLs can still be considered equal.
Characters that are allowed within an URL, such as letters and digits, should not be decoded at all.
A double slash usually is interpreted as a single slash, so Audisto strips it. This, however, is a semantic change!
Audisto sorts all query parameters by name and value. This is a semantic change.
Within a query, a space can be decoded both as
%20 - like in the path - and as
+. The later is used for HTML form submission, like in POST requests using the
application/x-www-form-urlencoded. Audisto will always turn
and spaces into a
+ within queries.
For query parameters, Audisto drops the equal sign, if there is no value. This is a semantic change.
For an empty query Audisto drops the question mark. This is a semantic change.
Audisto allows to detect similar URLs during crawling. This can be enabled in the settings. An URL is regards as being similar to another if they both result in the same URL after full normalization.
Therefore, similar URL detection will not work if URL Normalization Mode is Full.