URL Normalization

Single Resource - Single URL

About URL Normalization

A URL generally points to a web page, an image, a CSS file, or anything else on your site. All this different kind of content is called a resource.

While we crawl your websites we have to handle a lot of different URLs and this is far from easy. For the geeks: It means handling everything according to RFC 3986. For the rest of you: It means understanding your URLs, especially being able compare URLs and determine if two URLs could be considered equivalent.

Lets start with some simple examples:

https://www.example.com/ https://www.example.com/index.html https://www.example.com/path?a https://www.example.com/path?a= https://www.example.com/path?a&b https://www.example.com/path?b&a

When you crawl as many websites as we do, you'll notice that the first and last example are quite common. In all examples we have two different URLs. However in most cases those URLs lead to exactly the same resource and show exactly the same content. The problem is "in most cases". It is quite easy to write code that handles the two URLs differently. Due to the nature of this you would need to access both versions to be sure they identify the same resource.

So why is this a problem?

If you can't be sure that two URLs are actually showing the same resource, you'll have to crawl both. And that is exactly what search engines like Google do.

We setup tests for equivalent URLs and studied what major search engines crawled. The result: They crawled every single URL!

However: Once they have verified that the equivalent URLs show the same resource they handle them as a single resource.

But this of course consumes crawl rate, which means, crawlers spends less time on your other, considerably more important URLs. So you might end up with only a portion of available URLs being crawled and indexed.

So how bad is it?

The worst case scenario is a combination of the last two examples. A "=" could be applied to every parameter and all the parameters could additionally be reordered. Here is the number of URLs you can get for a single URL depending on the number of parameters:

1 => 2 2 => 8 3 => 42 4 => 264 5 => 1,920 6 => 15,840 7 => 146,160 8 => 1,491,840 9 => 16,692,480

For geeks: This is a problem of permutations and calculates as n!(Σ(1..n) + 1) for n parameters.

Whenever you use URLs with parameters you should make sure those URLs are only accessible in a single normalized version e.g. with alphabetically ordered parameters. Whenever you detect an access to a non normalized version you should redirect to the normalized version.

How Audisto deals with this

While sometimes duplicate content is created accidentally, often it is just because the same URL may be expressed in more than one way. For example https://www.example.com:80/path always will refer to the same resource than https://www.example.com/path.

This is where URL normalization kicks in. It defines a set of rules to transform one URL into another, making them easily comparable. URL normalization is defined in RFC 3986. These rules are implemented by Audisto, but can in parts be turned on or off.

Audisto URL normalization modes

Project, crawl and monitor settings offer two URL normalization modes:

  • None: Just basic normalization rules are applied, semantics are not changed. Not all of RFC 3986 is implemented.
  • Full: RFC 3986 is applied completely and additionally some semantic changes are done, like ordering query parameters.

URL-Normalization: None

If the URL normalization is set to None, only the most basic normalization rules are applied:

Resolving Paths

Within a URL , /./ and /../ have special meanings, because they refer to the current or parent directory. Audisto correctly resolves these path elements.

  • http://www.example.com/a/./b/../c becomes http://www.example.com/a/c
Encoding Special and Non-ASCII characters

Within a URL , only a small set of characters are allowed, such as letters, digits and some special characters like an underscore or a hyphen. Everything else has to be encoded using a percent sign and a two digit hexadecimal code. This also holds for a space. Audisto does these automatically.

  • http://www.example.com/ümlaut becomes http://www.example.com/%C3%BCmlaut
  • http://www.example.com/some space becomes http://www.example.com/some%20space
Removing Default Ports

Both the HTTP and the HTTPS protocol have a default port assigned, which is 80 in case of HTTPS and 443 in case of HTTPS. Audisto strips this ports, if they are part of a URL .

  • http://www.example.com:80/ becomes http://www.example.com/
  • https://www.example.com:443/ becomes https://www.example.com/
  • http://www.example.com:8080/ stays http://www.example.com:8080/
Non-empty Path

Audisto always sets / as path for HTTP and HTTPS protocol, if no path was given.

  • http://www.example.com becomes http://www.example.com/
Lowercase Host and Protocol

Audisto always turns the host and the protocol of a domain to lower case.

  • http://Www.Example.Com becomes http://www.example.com/
  • HTTP://www.example.com becomes http://www.example.com/

URL-Normalization: Full

If the URL normalization is set to Full, a lot more rules are used to normalize URLs. Some of these rules are changing the semantics of a URL , so that the normalized URL is - in a strict sense - not the equivalent to its source. However, given a moderately sane backend implementation, in real live the normalized URLs can still be considered equal.

Decoding Encoded Allowed Characters

Characters that are allowed within a URL , such as letters and digits, should not be decoded at all.

  • http://www.example.com/%41 becomes http://www.example.com/A
  • http://www.example.com/%7euser becomes http://www.example.com/~user
Removing Double Slashes In Path

A double slash usually is interpreted as a single slash, so Audisto strips it. This, however, is a semantic change!

  • http://www.example.com/some//path becomes http://www.example.com/some/path
  • http://www.example.com/some/path// becomes http://www.example.com/some/path/
Sorted Query Parameters

Audisto sorts all query parameters by name and value. This is a semantic change.

  • http://www.example.com/?b=1&a=2 becomes http://www.example.com/a=2&b=1
  • http://www.example.com/?a=2&a=1 becomes http://www.example.com/a=1&a=2
Space Becomes Plus Sign In Query

Within a query, a space can be decoded both as %20 - like in the path - and as +. The later is used for HTML form submission, like in POST requests using the mime type application/x-www-form-urlencoded. Audisto will always turn %20 and spaces into a + within queries.

  • http://www.example.com/?q=hello world becomes http://www.example.com/?q=hello+world
  • http://www.example.com/?q=hello%20world becomes http://www.example.com/?q=hello+world

Within a path, %20 is left untouched.

  • http://www.example.com/some%20path stays http://www.example.com/some%20path
Drop = In Query Parameter If Empty

For query parameters, Audisto drops the equal sign, if there is no value. This is a semantic change.

  • http://www.example.com/?q= becomes http://www.example.com/?q
Drop ? If Query Is Empty

For an empty query Audisto drops the question mark. This is a semantic change.

  • http://www.example.com/? becomes http://www.example.com/

Similar URLs

Audisto allows to detect similar URLs during crawling. This is enabled by default. A URL is considered similar to another if both result in the same URL after full normalization.

Therefore, similar URL detection will not work if URL Normalization Mode is Full.