New Advanced Settings for URL Normalization

Our latest release introduces a new setting for URL Normalization to give you even more insights. This is a big thing and we want to explain in detail what we did and why we did it.

While we crawl your websites we have to handle a lot of different URLs and this is far from easy. For the geeks: It means handling everything according to RFC 3986. For the rest of you: It means understanding your URLs, especially being able compare URLs and determine if two URLs could be considered equivalent.

Lets start with some simple examples:

http://www.example.com/path?a
http://www.example.com/path?a=

http://www.example.com/path?a&b
http://www.example.com/path?b&a

In both examples we have two different URLs. However in most cases those URLs lead to exactly the same resource and show exactly the same content. The problem is "in most cases". It is quite easy to write code that handles the two URLs differently. Due to the nature of this you would need to access both versions to be sure they identify the same resource.

When you crawl as many websites as we do, you'll notice that the second example is quite common. We recently stumbled across a shop where the order of the parameters was defined by the order of filters applied when filtering products in a category.

So why is this a problem?

If you can't be sure that two URLs are actually showing the same resource, you'll have to crawl both. And that is exactly what search engines like Google do.

We setup tests for equivalent URLs and studied what major search engines crawled. The result: They crawled every single URL!

However: Once they have verified that the equivalent URLs show the same resource they handle them as a single resource.

But this of course consumes crawl rate, which means, Google spends less time on your other, considerably more important URLs. So you might end up with only a portion of available URLs being crawled and indexed.

So how bad is it?

The worst case scenario is a combination of the two examples. A "=" could be applied to every parameter and all the parameters could additionally be reordered. Here is the number of URLs you can get for a single URL depending on the number of parameters:

1 => 2
2 => 8
3 => 42
4 => 264
5 => 1,920
6 => 15,840
7 => 146,160
8 => 1,491,840
9 => 16,692,480

For geeks: This is a problem of permutations and calculates as n!(Σ(1..n) + 1) for n parameters.

So what did we change in our behaviour:

Because we were aware of the problem, Audisto used to normalize the URLs without looking at the content. We just assumed that the URLs show the same resource. All possible equivalent URLs were transformed to a single URL by ordering the parameters and removing the "=" characters.

By doing so we were able to combine all the links for the different URLs and came close to the way major search engines deal with this kind of problem. At the downside: By doing so we became unable to point out this kind of problem.

We kept thinking about a really good solution to deal with this szenario and as of today we allow you to choose between a strict mode that does not normalize URLs and a full normalization mode that normalizes all the equivalent URLs to a single one. Additionally we made sure you can see all the equivalent URLs when crawling in strict mode.

Even though this is a change in behaviour, we made the strict mode the new default.

How can you prevent this?

Whenever you use URLs with parameters you should make sure those URLs are only accessible in a single normalized version. Whenever you detect an access to a non normalized version you should redirect to the normalized version.

 

Author