URL Normalization

Single Resource - Single URL

About URL Normalization

An URL generally points to a web page, an image, a CSS file, or anything else on your site. All this different kind of content is called a resource.

From time to time there are two URLs pointing to the same resource. For example www.domain.com/ and www.domain.com/index.html often identify a domain's main page, resulting in duplicate content.

While sometimes duplicate content is created accidentally, often it is just because the same URL may be expressed in more than one way. For example http://www.example.com:80/path always will refer to the same resource than http://www.example.com/path.

This is where URL normalization kicks in. It defines a set of rules to transform one URL into another, making them easily comparable. URL normalization is defined in RFC 3986. These rules are implemented by Audisto, but can in parts be turned on or off.

Audisto URL normalization modes

Project, crawl and monitor settings offer two URL normalization modes:

  • None: Just basic normalization rules are applied, semantics are not changed. Not all of RFC 3986 is implemented.
  • Full: RFC 3986 is applied completely and additionally some semantic changes are done, like ordering query parameters.

URL-Normalization: None

If the URL normalization is set to None, only the most basic normalization rules are applied:

Resolving Paths

Within an URL, /./ and /../ have special meanings, because they refer to the current or parent directory. Audisto correctly resolves these path elements.

  • http://www.example.com/a/./b/../c becomes http://www.example.com/a/c
Encoding Special and Non-ASCII characters

Within an URL, only a small set of characters are allowed, such as letters, digits and some special characters like an underscore or a hyphen. Everything else has to be encoded using a percent sign and a two digit hexadecimal code. This also holds for a space. Audisto does these automatically.

  • http://www.example.com/ümlaut becomes http://www.example.com/%C3%BCmlaut
  • http://www.example.com/some space becomes http://www.example.com/some%20space
Removing Default Ports

Both the HTTP and the HTTPS protocol have a default port assigned, which is 80 in case of HTTPS and 443 in case of HTTPS. Audisto strips this ports, if they are part of an URL.

  • http://www.example.com:80/ becomes http://www.example.com/
  • https://www.example.com:443/ becomes https://www.example.com/
  • http://www.example.com:8080/ stays http://www.example.com:8080/
Non-empty Path

Audisto always sets / as path for HTTP and HTTPS protocol, if no path was given.

  • http://www.example.com becomes http://www.example.com/
Lowercase Host and Protocol

Audisto always turns the host and the protocol of a domain to lower case.

  • http://Www.Example.Com becomes http://www.example.com/
  • HTTP://www.example.com becomes http://www.example.com/

URL-Normalization: Full

If the URL normalization is set to Full, a lot more rules are used to normalize URLs. Some of these rules are changing the semantics of an URL, so that the normalized URL is - in a strict sense - not the equivalent to its source. However, given a moderately sane backend implementation, in real live the normalized URLs can still be considered equal.

Decoding Encoded Allowed Characters

Characters that are allowed within an URL, such as letters and digits, should not be decoded at all.

  • http://www.example.com/%41 becomes http://www.example.com/A
  • http://www.example.com/%7euser becomes http://www.example.com/~user
Removing Double Slashes In Path

A double slash usually is interpreted as a single slash, so Audisto strips it. This, however, is a semantic change!

  • http://www.example.com/some//path becomes http://www.example.com/some/path
  • http://www.example.com/some/path// becomes http://www.example.com/some/path/
Sorted Query Parameters

Audisto sorts all query parameters by name and value. This is a semantic change.

  • http://www.example.com/?b=1&a=2 becomes http://www.example.com/a=2&b=1
  • http://www.example.com/?a=2&a=1 becomes http://www.example.com/a=1&a=2
Space Becomes Plus Sign In Query

Within a query, a space can be decoded both as %20 - like in the path - and as +. The later is used for HTML form submission, like in POST requests using the mime type application/x-www-form-urlencoded. Audisto will always turn %20 and spaces into a + within queries.

  • http://www.example.com/?q=hello world becomes http://www.example.com/?q=hello+world
  • http://www.example.com/?q=hello%20world becomes http://www.example.com/?q=hello+world
  • http://www.example.com/hello%20world stays http://www.example.com/hello%20world
Drop = In Query Parameter If Empty

For query parameters, Audisto drops the equal sign, if there is no value. This is a semantic change.

  • http://www.example.com/?q= becomes http://www.example.com/?q
Drop ? If Query Is Empty

For an empty query Audisto drops the question mark. This is a semantic change.

  • http://www.example.com/? becomes http://www.example.com/

Similar URLs

Audisto allows to detect similar URLs during crawling. This can be enabled in the settings. An URL is regards as being similar to another if they both result in the same URL after full normalization.

Therefore, similar URL detection will not work if URL Normalization Mode is Full.