URL Normalization
Single Resource - Single URL
About URL Normalization
A URL generally points to a web page, an image, a CSS file, or anything else on your site. All this different kind of content is called a resource.
While we crawl your websites we have to handle a lot of different URLs and this is far from easy. For the geeks: It means handling everything according to RFC 3986. For the rest of you: It means understanding your URLs, especially being able compare URLs and determine if two URLs could be considered equivalent.
Lets start with some simple examples:
https://www.example.com/
https://www.example.com/index.html
https://www.example.com/path?a
https://www.example.com/path?a=
https://www.example.com/path?a&b
https://www.example.com/path?b&a
When you crawl as many websites as we do, you'll notice that the first and last example are quite common. In all examples we have two different URLs. However in most cases those URLs lead to exactly the same resource and show exactly the same content. The problem is "in most cases". It is quite easy to write code that handles the two URLs differently. Due to the nature of this you would need to access both versions to be sure they identify the same resource.
So why is this a problem?
If you can't be sure that two URLs are actually showing the same resource, you'll have to crawl both. And that is exactly what search engines like Google do.
We setup tests for equivalent URLs and studied what major search engines crawled. The result: They crawled every single URL!
However: Once they have verified that the equivalent URLs show the same resource they handle them as a single resource.
But this of course consumes crawl rate, which means, crawlers spends less time on your other, considerably more important URLs. So you might end up with only a portion of available URLs being crawled and indexed.
So how bad is it?
The worst case scenario is a combination of the last two examples. A "=" could be applied to every parameter and all the parameters could additionally be reordered. Here is the number of URLs you can get for a single URL depending on the number of parameters:
1 => 2
2 => 8
3 => 42
4 => 264
5 => 1,920
6 => 15,840
7 => 146,160
8 => 1,491,840
9 => 16,692,480
For geeks: This is a problem of permutations and calculates as n!(Σ(1..n) + 1) for n parameters.
Whenever you use URLs with parameters you should make sure those URLs are only accessible in a single normalized version e.g. with alphabetically ordered parameters. Whenever you detect an access to a non normalized version you should redirect to the normalized version.
How Audisto deals with this
While sometimes duplicate content is created accidentally, often it is just because the same URL may be expressed
in more than one way. For example https://www.example.com:80/path
always will refer to the same
resource than https://www.example.com/path
.
This is where URL normalization kicks in. It defines a set of rules to transform one URL into another, making them easily comparable. URL normalization is defined in RFC 3986. These rules are implemented by Audisto, but can in parts be turned on or off.
Audisto URL normalization modes
Project, crawl and monitor settings offer two URL normalization modes:
- None: Just basic normalization rules are applied, semantics are not changed. Not all of RFC 3986 is implemented.
- Full: RFC 3986 is applied completely and additionally some semantic changes are done, like ordering query parameters.
URL-Normalization: None
If the URL normalization is set to None, only the most basic normalization rules are applied:
- Resolving Paths
-
Within a URL ,
/./
and/../
have special meanings, because they refer to the current or parent directory. Audisto correctly resolves these path elements.http://www.example.com/a/./b/../c
becomeshttp://www.example.com/a/c
- Encoding Special and Non-ASCII characters
-
Within a URL , only a small set of characters are allowed, such as letters, digits and some special characters like an underscore or a hyphen. Everything else has to be encoded using a percent sign and a two digit hexadecimal code. This also holds for a space. Audisto does these automatically.
http://www.example.com/ümlaut
becomeshttp://www.example.com/%C3%BCmlaut
http://www.example.com/some space
becomeshttp://www.example.com/some%20space
- Removing Default Ports
-
Both the HTTP and the HTTPS protocol have a default port assigned, which is 80 in case of HTTPS and 443 in case of HTTPS. Audisto strips this ports, if they are part of a URL .
http://www.example.com:80/
becomeshttp://www.example.com/
https://www.example.com:443/
becomeshttps://www.example.com/
http://www.example.com:8080/
stayshttp://www.example.com:8080/
- Non-empty Path
-
Audisto always sets
/
as path for HTTP and HTTPS protocol, if no path was given.http://www.example.com
becomeshttp://www.example.com/
- Lowercase Host and Protocol
-
Audisto always turns the host and the protocol of a domain to lower case.
http://Www.Example.Com
becomeshttp://www.example.com/
HTTP://www.example.com
becomeshttp://www.example.com/
URL-Normalization: Full
If the URL normalization is set to Full, a lot more rules are used to normalize URLs. Some of these rules are changing the semantics of a URL , so that the normalized URL is - in a strict sense - not the equivalent to its source. However, given a moderately sane backend implementation, in real live the normalized URLs can still be considered equal.
- Decoding Encoded Allowed Characters
-
Characters that are allowed within a URL , such as letters and digits, should not be decoded at all.
http://www.example.com/%41
becomeshttp://www.example.com/A
http://www.example.com/%7euser
becomeshttp://www.example.com/~user
- Removing Double Slashes In Path
-
A double slash usually is interpreted as a single slash, so Audisto strips it. This, however, is a semantic change!
http://www.example.com/some//path
becomeshttp://www.example.com/some/path
http://www.example.com/some/path//
becomeshttp://www.example.com/some/path/
- Sorted Query Parameters
-
Audisto sorts all query parameters by name and value. This is a semantic change.
http://www.example.com/?b=1&a=2
becomeshttp://www.example.com/a=2&b=1
http://www.example.com/?a=2&a=1
becomeshttp://www.example.com/a=1&a=2
- Space Becomes Plus Sign In Query
-
Within a query, a space can be decoded both as
%20
- like in the path - and as+
. The later is used for HTML form submission, like in POST requests using the mime typeapplication/x-www-form-urlencoded
. Audisto will always turn%20
and spaces into a+
within queries.http://www.example.com/?q=hello world
becomeshttp://www.example.com/?q=hello+world
http://www.example.com/?q=hello%20world
becomeshttp://www.example.com/?q=hello+world
Within a path,
%20
is left untouched.http://www.example.com/some%20path
stayshttp://www.example.com/some%20path
- Drop = In Query Parameter If Empty
-
For query parameters, Audisto drops the equal sign, if there is no value. This is a semantic change.
http://www.example.com/?q=
becomeshttp://www.example.com/?q
- Drop ? If Query Is Empty
-
For an empty query Audisto drops the question mark. This is a semantic change.
http://www.example.com/?
becomeshttp://www.example.com/
Similar URLs
Audisto allows to detect similar URLs during crawling. This is enabled by default. A URL is considered similar to another if both result in the same URL after full normalization.
Therefore, similar URL detection will not work if URL Normalization Mode is Full.