Rel Canonical Tag - Avoid Duplicate Content - SEO Guide

Canonicalization is a tool to address issues with internal and external duplication of content. It clarifies which URL is the preferred version of the content, when you have a lot of syndication or a GET parameter heavy environment.

This guide will point out common mistakes and show when to use rel=canonical - and when not to use it.

What is rel=canonical

Rel=canonical is a way to specify the preferred version of resources with duplicate content. It will not prevent your duplicates from being crawled and it will not prevent search engines from indexing your content. Instead it serves as a suggestion to search engines on which content it should prefer in the search results in case there are two or more documents that are identical or very similar.

To understand how rel=canonical is used by search engines, imagine the following:

The search engine has an unfiltered set of results for a search query. Now the search engine tries to eliminate the duplicates. At this point the suggestion made with the rel=canonical will be processed and in most cases be used as a directive to show your preferred URL for the content while filtering duplicate content.

There are quite a few cases in which the rel=canonical is commonly misused. The effect can be ranking loss or wasted ranking potential.

How to use rel=canonical

Canonical in HTML markup vs HTTP header

There are two ways to include a canonical into your site. One is in the markup in the <head> tag, which is the most well known method. In fact many CMS do this on their own by now.

<html>
<head>
<link rel="canonical" href="http://example.com/page.html">
</head>
<body>
...

The other way to include a canonical is sending it with the HTTP header.

HTTP/1.1 200 OK
Content-Type: application/pdf
Link: <http://example.com/page.html>; rel="canonical"
Content-Length: 4223
...

The HTTP header canonical is usually used for documents that are not HTML. This way you can use the HTTP header to set canonical URLs for images, PDF or any other document. In practice this is often applied for print versions of the content, downloadable PDF versions and stuff like that where you end up with non HTML duplicates.

When to use rel=canonical

The most common and reasonable case of canonical usage is a self reference in every unique document. The statement is basically: "Hey there, I am the original document. Get me indexed and list me in search results if different versions of this content are found fitting for the search query."

This approach usually prevents problems with identical copies of your content. Duplicate content can occur internally and externally.

The following scenarios can be prevented with proper canonical usage:

  • Problems with GET-Parameters
    • Tracking Parameters
    • Session Parameters
    • Unwanted / Unverified Parameters
    • Unsorted Parameters
  • Problems with multiple URLs for the same content
    • CMS has more than one version for the content (e.g. version with ID, and speaking URL)
  • Problems due to accessability on different hosts / protocols / ports
    • HTTP / HTTPS
    • Port 80 / 8080
    • www / without www
    • different domains
  • Duplicate content from external content syndication

How and when not to use rel=canonical

Choose full URLs over shortened URLs for rel=canonical

One possible source of problems with the canonical tag is usage of shortened URLs over full URLs as canonical URL.

There is always a good chance your website has content that is available using different protocols or hosts but with the same directory and file name.

The markup can contain exactly the same rel=canonical in different versions, but each version points to a different URL.

Consider you have two pages with different protocols setup like this:

http://example.com/page.html

<link rel="canonical" href="page.html">

https://example.com/page.html

<link rel="canonical" href="page.html">

Resolving those canonical links will result in the following different URLs:

  • http://example.com/page.html
  • https://example.com/page.html

The more complete your canonical URL is, the less error prone it is.

<link rel="canonical" href="page.html">

this can result in problems with

  • directories
  • hosts
  • protocols
<link rel="canonical" href="/page.html">

this can result in problems with

  • hosts
  • protocols
<link rel="canonical" href="//example.com/page.html">

this can result in problems with

  • protocols
<link rel="canonical" href="http://example.com/page.html">

this version is not affected by any of the problems.

While there are legit reasons to use short or relative URLs and it is allowed by RFC6596, there are possible issues that can be avoided with the use of absolute canonicals over relative URLs. Keep in mind:

  • relative URLs are shorter but more error prone and harder to maintain and to evaluate by third party crawlers if not applied 100% correctly.
  • If your site gets cloned to an external website, you will not benefit from relative URLs in the canonical tag - in fact you might be helping the content thief.

We strongly suggest that you always use absolute URLs when using rel=canonical.

Do not use rel=canonical for localization

If your website targets more than one country or more than one language you should help search engines to properly identify the correct URLs for the specific languages or target countries.

It is not very common but there are websites with multiple languages where the rel=canonical element points to a preferred language version.

The misbelief is that you can use the rel=canonical element to specify a preferred localized version.

You should use rel=alternate and hreflang tag to clarify which version belongs to which target market. It will then be easier for search engines to index all versions properly and show the right results to the right target group.

The hreflang tag can be used to make a connection between all distinct language versions of your content. You can also use x-default to specify a URL as default for users outside your focus regions or targeted languages.

More information about how to set up multilanguage websites can be found in the Google Search Console Help Center.

We strongly suggest not to use rel=canonical for localization.

Do not use rel=canonical for PageRank Sculpting

In some cases we found canonical usage for PageRank Sculpting. PageRank Sculpting is a method to shape PageRank flow on a website. Some of a websites pages are not supposed to rank in the search results. That's why some webmasters try to channel the PageRank past these pages in order to strengthen the actual landing pages. This usually applies to functional pages like "Imprint" or "About us" - or even in pagination and category pages.

In this example the "Imprint" or "About us" page would be set up with a canonical tag with an important landing page as a target URL.

The attempt is based on the assumption that a canonical will channel all linkjuice to the canonical URL, regardless of the pages content. Our tests have shown that this is not the case.

From our experience it isn't helpful and might result in search engine spiders ignoring the websites canonical tags at all.

Don't use canonical tags for PageRank Sculpting!

Canonical usage in pagination

Pagination is a technique for dividing content into discrete pages. Pagination is used when the content is too large to show on just one URL. In this case the content can be split into multiple pages.

If your website uses pagination, you might want to prevent search engines from crawling or indexing the pagination beyond the first page. This applies especially if

  • the pagination pages do not add any indexation value over page one
  • the website has a very deep pagination level so the crawler would spend a remarkable amount of your crawl budget to crawl your pagination
  • the pagination pages can be considered thin content

In a pagination you are usually not treating duplicate content. The content is not even considered to be similar. However, rel=canonical is for duplicate or similar content. Defining a non self referencing canonical URL on pagination pages is in most cases a misuse of the rel=canonical element.

With a non self referencing canonical URL on pagination pages, you basically tell search engines to ignore your content on the specific page. If this is your goal you should use the noindex robots directive to prevent indexing of those pages or block the crawling with a robots.txt.

Check out our Ultimate Pagination Guide to learn more about proper pagination of websites.

Canonical usage for similar products

A more common mistake - especially for online shops - is the usage of rel=canonical for very similar products (minor differences, different colours or product versions).

The positive effect: Your similar products won't get recognized as duplicate content.

The negative effect: Your canonical URL product version gets preferred in the search results. So if a customer searches for the alternate, the wrong product version is likely to show up in search.

Example:
You are selling T-shirts and you offer the same shirt with the different colour versions blue, red and yellow on different URLs. To prevent duplicate content issues you add your canonical to the best selling version of the product - let it be blue - to have it preferred in the search results. The default product gets a self referencing rel=canonical while the alternates all get a canonical pointing to the default product URL.

If a search user now searches for "red shirt" Google might remember it once found a red shirt in your shop but it also remembers you've told it (by using the rel=canonical) to show the blue shirt page.

This leads to a search result that is less fitting to the user search query and therefore a lower click through rate from the search results.

What you really want is getting all of your alternates into the Google index, showing up in the right moment. You can acheive this by using schema.org markup.

There is a neat little property called isSimilarTo. With the help of this structured data property you can tell Google that your blue, red and yellow shirts are all of the same importance but they only differ in one little aspect. Having this in action allows to use self referencing canonical URLs all your product alternates.

And that's exactly what you want to tell the search engines. Usually Google will get it right and show the right product version for the right query.

Canonical usage for mobile website versions

Sometimes a website offers a distinct mobile website version that is hosted on a extra host like m.example.com.

Now the question is how to prevent duplicate content issues between your main website and that extra mobile version of your website.

Google suggests using a combination of rel=alternate and rel=canonical.

Rel=alternate tells the search engines there are other versions of this content available that may be a perfect fit for the search user, depending on device and user agent.

In some cases we've seen that the mobile version is set up with a canonical pointing to the main version of the website but the rel=alternate was missing. The result is that the mobile website version will likely not be shown in the search results. Mobile users will find your desktop optimized web site in search results. Usually the desktop optimized website will not be mobile friendly and if there is no autodetection and redirect to the mobile website, the user will have a hard time using your website with his mobile device.

What you really want is to have both versions of the site to show up in the right situation. That's what you achieve using the combination of rel=alternate and rel=canonical.

example desktop version http://example.com

<html>
<head>
<link rel="canonical" href="http://example.com/" >
<link rel="alternate" href="http://m.example.com/" media="only screen and (max-width: 640px)">
</head>
<body>

example mobile version http://m.example.com

<html>
<head>
<link rel="canonical" href="http://example.com/">
</head>
<body>

Common unintended mistakes with the rel=canonical

Using multiple canonical tags with different target URLs

If you put a second canonical tag in your site and both carry interfering canonical URLs, search engines are likely to ignore both canonicals tags.

Usually this happens unintended as a result of SEO plugin usage, leading to weird search engine behaviour.

Take care: This also applies to HTTP header canonical and HTML canonical combinations. If you don't look closely it can be tricky to find.

Using canonical outside of <head> area

Another common mistake in canonical usage is to place it outside the <head> tag, especially in the body tag. Most search engines will ignore tags ourside the <head>. Especially Google will not interpret any canonical tag placed outside the <head> tag.

Canonical pointing to target URL with other status code than 200

If your canonical URL delivers a non-200 status code, it might cause your suggestion to get ignored by search engines.

In case the canonical URL target delivers a 30x redirect, it forces the search engine spider to crawl one additional URL. The numbers can add up pretty fast over this and waste your crawl budget.

Worst case are rel=canonical targets that show a 4xx or 5xx status codes. 4xx or 5xx status codes lead to a failure for the canonical link and therefore they likely force the search engine to ignore your canonical at all. As a result the index might get stuffed with duplicates.

Always make sure you check your canonical URL for proper functionality and server response code 200.

 

Author