Page Depth and Levels in the Audisto Crawler

If a user starts browsing a web site, some pages are easy to reach, since they are just a few clicks away, while other need quite some effort to be reached.

Page Depth is a measure of the distance from one resource to another. It counts, how many clicks it takes the user to move from the start page to any given page.

The number of necessary clicks to move to a given resource is called its level. The start page resides on level zero, since no clicks are needed. All pages that are accessible from the start page have a level of one. Pages that are linked by pages on the first level, have a level of two, and so on.

User and Bot View

The Audisto Crawler provides two different perspectives on page depth:

  • User View: How many clicks it takes for an actual user to reach a site
  • Bot View: How many requests it takes for a bot to reach a resource

The User View

Users do not see all references to other resources on a page. Their interaction with a site is mainly limited to

  • Clickable anchor links (the <a>-tag)
  • Redirects

The Audisto Crawler therefore considers only this kind of references when building the User Graph. It ignores references such as <link>, <img>, <script> etc.

When it comes to redirects, the Audisto Crawler regards them as being invisible to a user, so they do not increment the level. For example:

  • Page A has a user related level of 4
  • Page A links to page B
  • Page B redirects to page C

Link graph of three pages linking and redirecting, from a user's perspective

Both B and C would be assigned a user level of 5:

  • Page B gets a user level of 5, which is the level of page A plus one click
  • Page C also gets a user level of 5, since no click is needed to move from page B to page C

The Bot View

From a bot's perspective, things are different. A bot can see and understand meta references like a <link> or similar, and takes them into consideration when building the Bot Graph. What kind of references our crawler should follow is configurable when creating a crawl.

In contrary to the user view, the bot view is centered around requests, not clicks. The number of necessary requests to move to a given resource is counted. The requests must be legal, the bot must not violate the rules given by

  • robots.txt
  • robots directives
  • nofollow attributes

Given the above example of a redirect:

  • Page A has a bot related level of 4
  • Page A links to page B
  • Page B redirects to page C

Link graph of three pages linking and redirecting, from a bot's perspective

Page B would get a bot level of 5, and page C a bot level of 6.

  • Page B gets a bot level of 5, which is the level of page A plus one request
  • Page C gets a bot level of 6, which is the level of page B plus one request
Nofollow links

An edge case is the handling of pages that

  • first are linked nofollow, but
  • later on are linked follow as well.

In this case, the bot depth is set to be below the first resource linking with follow. For example:

  • Page A has a bot level of 4
  • Page A links to page B using follow
  • Page A links to page C using nofollow
  • Page B links to page C using follow
  • Page C links to page D using follow

Link graph of four pages linking follow and no follow

Page D would be assigned a bot level of 7, not 6, as one may think.

  • Page B gets a bot level of 5, which is one below page A
  • Since page B links follow, page C gets a bot level of 6, which is one below page B
  • Page D gets a bot level of 7, which is one below page C

This is because the shortest legal path to page D is via B and C, not via C directly.

Crawled vs. Discovered Resources

Regarding the distribution of resources over levels, the Audisto Crawler divides between

  • Crawled Resources: Resources that were downloaded and processed by the crawler. These are HTML pages.
  • Discovered Resources: Every resource the crawler knows about, since it was linked through one way or another.