Crawl Budget and Crawl Rate Optimization

Crawl budget and crawl rate optimization is about feeding content to search engines and have it being refreshed more often. They can be very important tools for technical search engine optimization, especially for large websites.

This article will discuss how to detect crawl-related problems and provide helpful techniques to improve your website.

Technical Background

Search engines have changed a lot over the last years to continue providing the best search results to their users. However one thing did not change: They still only crawl a small portion of the web. The web is too big to crawl and index all of it.

So the challenge for any search engine is to keep a fresh index of important documents despite their limited resources. This implies to make decision like:

  • Prioritize one document in favor of another
  • Crawl one page and ignore another
  • Re-crawl a resource more or less often or never

These limitations affect all websites. However, large websites need to deal with this to a larger extend.

What is Crawl Budget?

The crawl budget is set by search engines, and determines the maximum number of requests the crawler will do. Consider for example the number of request, a search engine is willing to do on your site per month.

Search Engines usually use a number of parameters to assign the crawl budget to a given host. Here are some potential parameters:

  • Relevancy
  • Performance
  • Size
  • Freshness
  • Links

For obvious reasons, the number of crawlable URLs on a site should be lower than the crawl budget, so search engines can fully crawl the site and maintain a fresh index for all pages within the site.

What is Crawl Rate?

The crawl rate is the number of crawl requests within a specific time frame. Think of it as how often a search engine crawls a specific page or directory per hour or day.

You want to maintain a sufficient crawl rate for each URL, because:

  • If the content for a URL has changed, you want the change to be noticed by search engines as soon as possible. A sufficient crawl rate for the URL is the foundation to have the change indexed shortly.
  • If you have new URLs, you want them to be crawled as soon as possible. A sufficient crawl rate on the documents linking to the new URLs, is the foundation to have your new URLs crawled within short time.

Goals for Crawl Budget- and Crawl Rate Optimization

The goals for crawl budget and crawl rate optimization are to achieve

  • A full crawl of your site within a reasonable amount of time
  • A fast recognition of changes by search engines
  • A fast discovery of new content by search engines
  • An optimized crawl that saves resources (server and traffic cost)

Short: We want to achieve deeper crawling with fewer requests and higher freshness.

Identify Crawl Budget Issues

A website might have crawl budget issues if it has so many URLs, that search engine spiders need a long time to crawl the entire website.

If one of the following issues exists for a website, you might consider it a crawl budget issue:

  • Website does not get fully crawled
  • A full crawl takes longer than a month
  • New or changed urls are crawled with a long delay

To see whether a search engine crawls the entire site or if there are parts of the site that do not get crawled, one needs to compare data on search engine requests - usually a web-server logfile - with data on the site's structure, for example from a full Audisto crawl.

Identify Crawl Rate Issues

There are two types of crawl rate issues:

  • Low crawl rate
  • High crawl rate

Crawl rate issues can be detected by examining a web-server's logfile.

Low Crawl Rate Issues

If changes are made on the website's content or fresh new content is released, it might go unnoticed from search engines for too long. In this situation it could take up to weeks even for the big search engines to reflect content changes in their search results.

This issue can be detected with a simple logfile analysis. A look into the logfiles reveals when a page was crawled. The timestamp in the logfile can be compared to the release date or modification date of the content.

If one of the following issues exists for a website, you might consider it a low crawl rate issue:

  • A new URL has not been crawled within one day
  • An URL with content changes has not been crawled within one day

High Crawl Rate Issues

High crawl rate can cause issues as well. High crawl rate on documents without frequent changes can consume crawl budget and cause a low crawl rate issues for recently changed or new documents.

This type of issue can be detected with a logfile analysis as well. For each URL the number of requests within a time time frame should be counted. The rate should be compared with the rate of actual changes.

If one of the following issues exists for a website, you might consider it a high crawl rate issue:

  • A URL with low change frequency gets crawled several times a day
  • A URL with high change frequency gets crawled several times a day because it lists new content, the new content will not be crawled within a day
  • The response time of the server drops because of a large number of simultaneous requests

How to optimize Crawl Budget and Crawl Rate

Basic principles of Crawl Budget and Crawl Rate Optimization

The crawl budget and the crawl rate can be optimized following these principles:

  • Reduce the number of crawlable URLs
  • Remove the performance bottlenecks
  • Provide information about changed and unchanged content
  • Shift priority to relevant URLs

Let's have a look at the principles in detail.

Reduce the number of crawlable URLs

There are a number of optimization opportunities that you can apply to reduce the number of URLs on a website.

Server Status 30x - Redirected URLs

30x redirects on a website get crawled by search engines. This consumes crawl budget. To minimize the crawl budget consumption, you should only keep necessary redirects.

From a SEO perspective those are redirects that

  • Collect link juice
  • Collect traffic

Internal links to redirected URLs should be changed so they point directly to the redirect target URLs.

As soon as the internal link graph is cleaned, a look into the server logfiles reveals all redirects triggered by external factors. Those might be incoming links from external websites, bookmarks, or simply the fact that the URLs are already known to search engines.

If no external links point to redirects and redirects don't have traffic, then it is safe to remove them and send a 410 status code. For most search engines the status code 410 will minimize the number of recurring crawls after the first detection of the status code.

Server Status 4xx - URLs that are no longer active

The older a website, the higher the risk to have internal links on your site that point to URLs with a 4xx status code. These internal links add no value for the user while search engine spiders will follow them and crawl the target pages. This will consume your crawl budget.

The crawl budget usage can be optimized by removing these links.

URLs for outdated documents or with no individual value

Crawling of URLs for temporary relevant content or content with no individual value will also consume your crawl budget, while not providing value to users or search results.

The crawl budget can be freed by removing outdated content, duplicates or highly similar content or thin content.

URLs for documents that should not be indexed

URLs that should not be indexed are usually marked with noindex or have a canonical element that points to another page. From a SEO perspective they don't generate any traffic for the website - but they consume crawl budget.

Crawl budget can be freed by removing the URLs or by excluding them from crawling.

URLs for faceted navigation

Faceted navigation can generate a very large number of URLs. Allowing users to filter by e.g. attributes, categories, price, availability and allowing them to sort by e.g. relevance, rating and popularity can easily generate millions or even billions of URLs.

A smart faceted navigation can preserve the functionality for the user and help to free crawl budget by removing URLs that are not relevant from a SEO perspective.

URLs due to GET-Parameters

Another classic reason for crawl budget waste and bad crawl rate on important pages are websites with a lot of GET-parameters. Issues with GET-parameters often have their origin in the faceted navigation. Other popular reasons are:

  • GET parameters used in random order
  • Tracking parameters

To reduce the number of crawlable URLs, GET-parameters should always be ordered, e.g., alphabetically.

Instead of using GET-parameters for tracking, you might consider using other tracking solutions or at least redirect the user after accessing a tracking URL.

Note: If you are using Google Analytics with GET-parameters for campaign tracking, you should consider to use anchors for tracking instead of GET-parameters.

URLs due to reference in XML Sitemap

If you have outdated XML sitemaps, these might contain URLs that are no longer active. Search engine spiders will crawl the URLs and crawl budget is wasted.

Keep in mind that search engines discover XML sitemaps via different ways and that they might continue crawling the sitemaps even after all references are removed.

Most search engines list discovered XML sitemaps in the webmaster tools. A logfile analysis might also reveal this kind of problems.

URLs for non-HTML documents

Search engines usually index a limited set of file types. To determine the file type of a document, the URL for the document has to be crawled. This means: If a website has links pointing to an URL, the URL will get crawled even if the file type of the document can not be indexed. This can waste crawl budget and might generate a lot of traffic e.g. for downloads.

We suggest not to link to non-indexable file types. We also suggest to block the non-indexable files from crawling e.g. via robots.txt (read our article about robots.txt files for more information).

Remove Performance Bottlenecks

Crawling can be very performance intensive. Priority for search engines is to achieve a full crawl of a website as fast as possible without hurting it in terms of performance.

The search engines measure response times. If response times indicate that there might be a performance problem with the website, the crawler slows down and the number of requests within a specific timeframe is lowered. Single low performance URLs can be interpreted as site-wide performance problems by search engines and might result in a lower crawl rate.

We suggest to measure the response time of any document in your tracking software to detect performance issues. URLs with bad performance should then be optimized for better performance.

When searching for performance bottlenecks, you might look for:

  • High pagination pages
  • Internal search result pages
  • Pages that use data from slow resources
    • External data
    • Slow databases
  • Pages that are only fast when cached

Provide Information About Changed and Unchanged Content

Search engines try to optimize their crawl behaviour to achieve a higher index freshness with less requests. A part of the optimization is to use information about content freshness to understand if a document needs to be crawled or not. Information about changes on the website can be provided by the webmaster:

XML Sitemaps

XML Sitemaps can be used to deliver additional information to the search engine crawlers about changed content on URLs.

The following information can be provided to influence the crawl behaviour

  • Change frequency (changefreq)
  • Last modification date (lastmod)

If you have the opportunity to set the "last modification date", we suggest to use this information exclusively, because it is the most precise information you can supply to a search engine crawler.

If you have URLs that aggregate or list content, you can use the last modification date of the latest changed document.

If you don't have a last modification date but you are able to estimate a change frequency, you can provide a change frequency.

If you find it difficult to estimate a change frequency, we suggest not to set a change frequency at all. Inaccurate values might cause the search engine to ignore the provided information. We also suggest not to use change frequency and last modification date for the same URL at once.

On large sites we suggest to provide a "hot sitemap", that lists only URLs which have been changed recently. Search engines tend to access frequently changed sitemaps more often.

HTTP Status 304 - Not Modified

Search engines use Conditional Requests when crawling the web to minimize the resources when crawling unchanged content. The HTTP headers in the conditional requests like "If-Modified-Since" and "If-None-Match" allow the website to answer with a HTTP status 304 "not modified". The HTTP status 304 "not modified" tells clients if a given document has changed since the last time it was accessed by the client.

If the client has performed a conditional GET request and access is allowed, but the document has not been modified, the server SHOULD respond with this status code. The 304 response MUST NOT contain a message-body, and thus is always terminated by the first empty line after the header fields.

Answering requests with a 304 status code can be significantly faster than supplying the unchanged content. In addition, the search engine does not need to reindex the content. This can help to increase the crawl rate.

Ping Search Engines on Content Refresh

A good way to tell search engines your content has changed, is to use the ping services of the search engines. A ping is a push notification telling a service or a search engine that content has changed or new content was published. From a SEO point of view two ping mechanisms are especially interesting: XML-RPC push and automated sitemap submission.

XML-RPC-based push mechanism for single documents on blogs

You can send a post request as a XML-RPC based notification to the search engines ping services to notify them about recent content changes on a single document. Most blog systems have a ping feature built in. You can find more information about XML-RPC pings at Wikipedia.org.

Submitting a sitemap via an HTTP request

The sitemap protocol specifies that search engines can be notified about sitemap changes using a HTTP request. The mechanism is described in the sitemaps.org protocol.

Whenever one or more documents have changed, you can ping the search engines with the corresponding sitemaps.

We suggest to work with a "hot sitemap" and notify search engines about changes in the sitemap with a HTTP request.

Shift Priority to Relevant URLs

If you have relevant URLs that have a low crawl rate, you might be able to optimize the crawl rate by shifting priority to these URLs.

Shifting priority can be done by changing the internal link structure of your site. Changing the internal structure can be done by:

  • Adding links
  • Removing links

If you find out your relevant documents don't have enough internal weight, you can add more links to them. You can also consider removing links to unrelevant documents instead.

Metrics like PageRank and CheiRank can help to identify problems with your internal site structure.

How we help - Server Logfile Analysis and Website Audits

We can provide assistance in analyzing your server's logfiles. We can extract all relevant data for you, and can also compare it to one of the crawls from the Audisto Crawler. The crawler checks every page for a number of hints to help identify structure issues that could lead to crawl budget waste.

Do you need personal assistance in improving your crawl budget usage? Just contact us, we will be happy to help.

 

Author