Isolation Level of HTML Documents

Audisto calculates the Isolation Level of all HTML documents. This feature allows you to quickly analyze how many documents are possibly negatively affected by long term noindex and long term canonical.

Shows the Audisto Isolation Level report.

About isolated pages

When crawling the web search engines try to spend their resources as efficient as possible. Modern search engines use various techniques to identify important and unimportant documents and adjust their crawling continuously. In this process it can happen that even important documents become isolated from the rest of the link graph and will not be crawled and indexed anymore.

Shows three pages A, B, and C, where C has no incoming links and therefor is isolated

In this article we will discuss how important documents can become isolated.

Methods to help search engines with crawl efficiency

A webmaster has a number of possibilities to help search engines with the crawling process. The following methods are well known to prevent crawling:

  • Disallow in robots.txt
  • Nofollow-Links

To optimize crawling you could provide machine readable information about changes using:

  • XML-Sitemaps
    • lastmod
    • changefreq
  • Robots directive
    • unavailable_after

In addition the characteristics of the documents are also known to have an effect on the crawling:

  • Freshness
  • Change frequency
  • Indexability
    • Robots directive
    • Canonical

While freshness and change frequency do not isolate documents, the indexability of a document can.

How Google handles Long Term Noindex

From a search engines perspective the crawling of documents that are excluded from indexing is ineffective. In late 2017 Google's Webmaster Trends Analyst John Müller explained that the valuation of long term noindex pages can change over time and therefore has effect for the link graph.

Afterwards he also answered a number of questions on Twitter:

Question on Twitter to John Müller: So noindex, follow after a time is treated as noindex, nofollow? How long would you say before that happens? Answer: It depends :-)

John Müller on Twitter: If we end up dropping a page from the index, we end up dropping everything from it. Noindex pages are sometimes [..] like 404s.

All this leaves us with some risks for our SEO. To get a better understanding we will take a more detailed look.

How the indexability of a document isolates other documents

Let's look at a number of scenarios with three documents (A, B, C) and see what's happening to document C based on the indexability of document B. In all scenarios there are links to connect A ⇆ B and B ⇆ C. Document A is always considered to be crawled and set to "index, follow". Document C is always "index, follow" and linked from document B.

Normal Indexation Scenario

Setup

Document B has a "index, follow" robots directive.

Shows three pages A, B, and C, where C is linked from B, which is index, follow.

Result

All documents can be crawled and indexed. All links are followed.

Shows three pages A, B, and C, where C is linked from B, which is index, follow. C is not isolated.

Long Term Noindex Scenario

Setup

Document B has a "noindex, follow" robots directive.

Shows three pages A, B, and C, where C is linked from B, which is noindex, follow

Result

In the beginning all documents will be crawled and document C will be indexed. All links are followed.

Shows three pages A, B, and C, where C is linked from B, which is noindex, follow. C at first is indexed.

After some time the valuation of document B changes due to long term noindex. The outgoing links from document B will be removed from the link graph. Document C becomes isolated.

Shows three pages A, B, and C, where C is linked from B, which is noindex, follow. C becomes isolated, once the links of B have been removed from the link graph.

Long Term Rel-Canonical Scenario

Even though John Müller has not made any explicit comments regarding long term rel-canonical, following the same "use resources efficient" logic it is very likely that we will see a similar interpretation as with long term noindex:

Setup

Document B has a "index, follow" robots directive but a canonical link pointing to document A.

Shows three pages A, B, and C, where C is linked from B, which is index, follow, while B is referencing A as canonical.

Result

In the beginning all documents will be crawled and document C will be indexed. Document B will most likely not show up in search results due to the canonical pointing to another page. All links are followed.

Shows three pages A, B, and C, where C is linked from B, which is index, follow, while B is referencing A as canonical. C at first is indexed.

After some time the valuation of document B changes due to long term rel-canonical to another URL. The outgoing links from document B will be removed from the link graph. Document C becomes isolated.

Shows three pages A, B, and C, where C is linked from B, which is index, follow, while B is referencing A as canonical. C becomes isolated, once the links of B have been removed from the link graph.

What does it mean for pagination and other complex scenarios (e.g. HTML-Sitemaps)

Pagination

Setup

We have a category C with a "index, follow" robots directive and pagination pages P1, P2 and P3 with a "noindex, follow" robots directive. All of those pages link to a number of item pages with a "index" robots directive.

Shows a pagination setup, where the pagination pages are set to noindex, follow.

Result

In the beginning all documents will be crawled and the category and item pages will be indexed. All links are followed.

Shows a pagination setup, where the pagination pages are set to noindex, follow. At first, all pagination and item pages are indexed.

After some time the valuation of P1 changes due to long term noindex. The outgoing links from P1 will be removed from the link graph. P2, P3 and all item pages linked from the pagination pages become isolated.

Shows a pagination setup, where the pagination pages are set to noindex, follow. Once the first pagination page's links are removed from the link graph, all other pagination pages and all item pages become isolated.

Complex scenarios (e.g. HTML-Sitemaps)

Setup

We have a number of documents A to I. Document D and F have a "noindex, follow" robots directive. Document B has a rel-canonical pointing to document A.

Shows a complex link graph with noindex, index and canonical links.

Result

In the beginning all documents will be crawled. All links are followed and all documents except D and F will be indexed. Document B will most likely not show up in search results due to the canonical pointing to another page. All links are followed.

Shows a complex link graph with noindex, index and canonical links. At first all pages (except noindex) are indexed.

After some time the valuation of D and F changes due to long term noindex. The outgoing links from D and F will be removed from the link graph. Document E becomes isolated.

Shows a complex link graph with noindex, index and canonical links after links of noindex pages are removed from the link graph.

If we consider the same effect for Long Term Canonical, the valuation of B changes as well. The outgoing links from B will be removed from the link graph. In this case C, D and E become isolated.

Shows a complex link graph with noindex, index and canonical links after links of noindex pages and pages pointing canonical to others are removed from the link graph.

Also noteworthy that the shortest path to G changes from A → B → G to the longer path A → I → H → G.

While unfortunately there is no explicit statement after what exact timeframe the long term effects take place, as described there is significant risk involved if this effect is not analyzed properly. If you find larger parts of your project affected, it would be a good idea to start with a proper structural analysis.