How many pages do I need to crawl? - Crawling scenarios

For a website analysis it makes a difference, if you crawl the whole site or just parts of it. Whenever you hit the limits you might miss important insights. There are , however, some circumstances where it is enough to just crawl sections. With this guide we want to point out the difference between incomplete crawls and crawls that cover a whole site.

This guide covers:

Which type of crawl is right?

Incomplete Crawls

Incomplete crawls are great to get started. If you crawl your site for the first time it is very likely that you will find some major structural and onpage-problems.

Once you've spotted a problem you can fix it. By fixing structural problems you usually reduce the number of pages that will be found in the next crawl. By fixing onpage problems you usually fix templates and all pages using the template.

This is great to get started however there are some scenarios that require complete crawls!

Complete crawls

Complete crawls give you much more insights. Once you've fixed some major problems you should aim at crawling your whole website.

The right type of crawl for different problems

Structure of your Website

Only a complete crawl can be used to do precise statements about the structure of the site. A complete crawl can tell you how many pages you have, how they are distributed across the levels and how the pages are connected. With a complete crawl, we can calculate metrics like PageRank and CheiRank to identify your most important pages.

With an incomplete crawl all of this is not possible. You have a good chance that there are much more pages than you expect. We've seen infinitive paginations and calendars and lot's of problems with relative links. With an incomplete crawl you can only tell that you have at least the number of discovered levels and whenever you calculate a metric you already know that the data is not precise.

Always do complete crawls when you try to work on the structure of your site!

Problems in Templates

With a complete crawl you will usually find all the problems in all your relevant templates.

With an incomplete crawl it is likely that there are templates that weren't covered. We've often seen sites where a specific type of page only occurs on lower levels. The templates used for those pages get only covered when the crawl can reach the pages.

Problems with your Data

With a complete crawl you can do effective testing of your data. The crawler will usually see all the displayed data at least once.

With an incomplete crawl you might miss errors in your data. We've seen large sites where important data was missing in a fairly large set of pages. Sometimes the pages were even unusable or links to those pages weren't clickable because of a missing title.

If you want to identify problems within your data you would have to analyze all your pages!

Programming Problems

With a complete crawl you also test the programming of your site. You might not find new problems but you can at least tell that your code is capable to handle and display the data you have right now.

With an incomplete crawl you might miss errors in your programming. We've seen sites that had problems to handle some data and showed a detailed errorreporting with database passwords instead of the content. With an incomplete crawl you miss the chance to detect this kind of stuff.

Problems with Responsetimes

With an complete crawl you'll get a detailed analysis of the responsetimes of all crawled pages. This data is very useful to spot performance issues.

With an incomplete crawl there is a good chance that you miss a number of performance issues. We've seen a number of sites where only cached pages were fast and where uncached pages on deeper levels had a very poor performance. In fact we've even seen sites where the performance was so bad that a crawl with 24 parallel requests was enough to produce results similar to a DDoS attack.

If you want to check against such kind of performance issues and security issues that come with them, you have to aim for complete crawls as well!

 

Author