Errors created by the Audisto Crawler
Table Of Content
- 5xx Server Error
- Connection Abort
- Content Size Too Large
- Error Executing Scripting Condition
- Error Processing Content
- Error Rewriting
- HTML Parser Error
- HTML is too large to be parsed
- HTTP Header Too Large
- Has Unicode Byte Order Mark (BOM)
- Illegal Charset
- Internal Proxy Problems
- Internal error
- Invalid Compression
- Invalid HTTP Request Header
- Invalid HTTP Response Header
- Invalid Location Header
- Invalid MIME Type
- Invalid Transfer Encoding
- Location Ignored By Rewrite
- Malformed URL
- Missing Content-Type HTTP Header
- No Connection
- No HTTP Headers Sent
- Problems Parsing robots.txt
- Problems with SSL
- Redirect Loop
- Redirect to self
- Redirect without valid location header
- Redirects for robots.txt
- Rendering Timed Out
- Rendering: Error Rendering Content
- Rendering: Incomplete Resources
- Required HTTP Header Missing
- Response download took too long
- Retrying After Errors
- SSL: Misconfigured Server Name Indication (SNI)
- Strict and Relaxed Parsing Differ
- Timed Out
- Too Many Redirects
- Too Many Requests
- URL too long to be stored
- Uncategorized Exception
- Unexpected Content Type
- Unexpected HTTP Status Code
- Unknown Host
- XML Sitemap: Error Parsing Content
- XML: Error Parsing Content
Errors
5xx Server Error
Description
This error is triggered if we encounter a 5xx HTTP status code when fetching a URL. 5xx HTTP status codes indicate server errors that might be temporary. We will perform up to two additional tries to download URLs with a 5xx server error.
Importance
If you see this error we managed to establish a successful connection to the server but our request was answered with an HTTP status code indicating an server error.
This usually indicates server problems of the following kind:
- Programming errors
- Servers that are temporarily down or in maintenance mode
- Bad or slow responses from backend servers behind a load balancer or CDN service
- Exceeded resources like storage or bandwidth
- A firewall blocks the traffic from our servers to your servers, but falsely uses an HTTP status code that indicates server errors instead of a status code that indicates access restriction
Not fixing these issues usually has the following effects:
- Search engines often treat server errors as temporary service interruptions and slow down or even pause their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website when seeing server errors. This decreases page views and conversion rates and has a negative impact on your business.
- Search engines will deindex outdated content if multiple tries to update failed. In the long term this will have a negative impact on your rankings and traffic received from search engines.
Note: Within Audisto we handle this type of error as load indicator as part of our Overload Protection Through Throttling.
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- If multiple tries fail with a 500 HTTP status code it is usually a programming error. Check your server and application logs for more details.
- If only the first try fails with a 502, 503 or 504 HTTP status code and a subsequent request completes, it is likely that you have performance or timeout issues within your infrastructure.
- If multiple tries fail you usually have general availability issues within your infrastructure e.g. due to slow code execution, backend errors, overloaded servers or networking issues.
Connection Abort
Description
This error is triggered if the connection to the server was aborted before the request could complete. If we encounter this error we schedule a retry. If the retry is successful previous errors will be marked as resolved. We will perform up to two additional tries to download URLs with a connection abort error.
Importance
If you see this error, we successfully established a connection to the server and requested a resource. The connection than broke before the request could complete to establish a connection to the server and therefore could not request the content for the URL.
This usually indicates networking problems of the following kind:
- The server has been going down and can't be reached anymore
- A backend service had a malfunction and the request could not be answered
- Overloaded servers with insufficient resources to handle the request
- Internal timeouts (e.g. the load balancer stopped waiting for the response from a backend server)
- Network issues like overloaded network interfaces or unstable connections anywhere between your and our servers or within your own infrastructure.
- A firewall blocks the traffic from our servers to your servers
Not fixing these issues usually has the following effects:
- Search engines will treat connection errors as load indicators and slow down their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website because it is not loading. This decreases page views and conversion rates and has a negative impact on your business.
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- Try to access the website in your browser to see if there are connection problems for you as well.
- To test the connectivity from various locations start with using one of the many free services that allow website reachability tests from multiple locations worldwide.
- If only some of your requests show no connection errors and other requests complete, it is likely that you have overloaded servers or an overloaded or unstable network or internal timeouts.
- If the first requests of a crawl are fine and at some point in time it switches to no connection errors, you might have a firewall that started to block the traffic.
- If all requests show a no connection error a routing problem or firewall block is most likely the issue.
Note: We monitor our network and if we detect general connectivity issues we pause our crawling until the issues are resolved. If you suspect issues for the networking path between our and your servers that can't be resolved on your end feel free to contact us.
Content Size Too Large
Description
This error is thrown if we encounter a file that is too large.
We enforce several limits in content size:
- 100 MiB for textual content, like HTML, XML, CSS or JavaScript
- 100 MiB for images
- 500 MiB for any other kind of content
The content size is the size of the uncompressed HTTP response body, e.g. the size of the HTML, not the size of the compressed HTTP response.
Importance
If you see this error we successfully started downloading a document, but canceled the download, because the file was too large. The file's content is incomplete, and will probably not be analyzed.
Note: Browsers might still be able to handle and display documents of large size.
Operating Instruction
Very large files pose a problem for most users, especially on mobile devices. Ensure everything you serve to users is as small as possible,
Error Executing Scripting Condition
Description
This error is thrown if we encounter problems when executing scripting conditions like clusters or monitoring checks during the crawl.
Importance
This is an internal error that occurred when attempting to execute a scripting condition. This can happen when executing the scripting condition is extremely slow (e.g. inefficient regex, or xpath condition) and we therefore stopped executing it. In that case we mark the URL with the error.
Operating Instruction
Have a look at the scripting issues report of the corresponding crawl. Try to optimize the performance of your scripting conditions.
Feel free to contact us for further assistance.
Error Processing Content
Description
This error is thrown if we encounter problems analyzing a document.
Importance
If you see this error we successfully started downloaded and parsed a document, but encountered errors when analyzing it, e.g. when extracting hints.
Note: Browsers might still be able to handle and display the document.
Operating Instruction
This error is rare. It mostly occurs if we process content as HTML, which is not HTML - for example a PDF document that was sent with a content type of HTML. Check if the content type of the document is correct.
If the reasons for this error are not obvious, feel free to contact us.
Error Rewriting
Description
This error is thrown if we encounter problems when applying rewrite rules to URLs we found during the crawl.
Importance
This is an internal error that occurred when attempting to rewrite a URL. This can happen when a regex within your rewrite rules is extremely slow, and we therefore stopped executing it. In that case we mark the URL with the error.
Operating Instruction
Have a look at the scripting issues report of the corresponding crawl. Try to optimize the performance of your rewrite rules e.g. by minimizing the step count.
Feel free to contact us for further assistance.
HTML Parser Error
Description
This error is triggered if parsing a downloaded document as HTML failed. In this case the Content-Type HTTP header indicated that the document is HTML or XHTML, however the downloaded content was not.
Example
HTTP header that indicates an HTML document
Content-Type: text/html; charset=utf-8
HTTP header that indicates an XHTML document
Content-Type: application/xhtml+xml; charset=utf-8
Importance
If you see this error our HTML parser could not identify the downloaded content as HTML.
This usually indicates problems of the following kind:
- Binary data (e.g. an executable file or image) was returned instead of an HTML document
- The HTML document was encoded with a compression (gzip, deflate) but the Content-Encoding HTTP header was missing
- The HTML document was compressed twice e.g. by the application and in addition by the webserver
Note: If you can properly access the affected URL in your browser this often means that your browser managed to guess the Content-Encoding correctly. Bots usually don't have the capability to detect the encoding.
Operating Instruction
Make sure that you send the proper Content-Type HTTP header and the proper Content-Encoding HTTP header.
HTML is too large to be parsed
Description
This error is thrown if we encounter an HTML page that is too large to be parsed. In this case our HTML parser is unable to handle the request.
Importance
If you see this error our HTML parser could not handle the response of the requested URL due to its size.
This usually indicates problems of the following kind:
- The filesize of the document is unusually large
- The DOM tree of the document is unusually complex
Since our servers have significantly more resources than normal home computers, it can be assumed that processing problems can also occur for users of the website.
Users might encounter the following:
- The page can't be loaded at all
- Loading the page takes a very long time
Operating Instruction
Optimize the page, so it can be parsed and rendered with fewer resources. Use tools like PageSpeed Insights to start your optimization.
HTTP Header Too Large
Description
This error is triggered if the HTTP response header was larger than 16 KiB.
Importance
While HTTP does not define any limit for the size of the HTTP headers most web servers limit the size of headers they accept and process. Servers will return an HTTP status code 413 (Entity Too Large) error if headers size exceeds that limit.
The following size limits are common:
| Webserver | Size Limit |
|---|---|
| Apache | 8 KiB |
| Nginx | 4 KiB - 8 KiB |
| IIS | 8 KiB - 16 KiB |
| Tomcat | 8 KiB – 48 KiB |
Sending HTTP headers above 8 KiB can cause problems when passed through load balancers, proxy servers or firewalls and can result in resources being inaccessible by users and bots. Large HTTP headers can also cause significant delays when loading the resource. Especially when transmitted using HTTP/1.x, where HTTP headers can not be compressed during transfer.
Note: Our bot accepts HTTP headers up to 16 KiB, larger headers result in an error "HTTP Header Too Large". If we encounter HTTP response headers larger than 8 KiB we mark the corresponding URLs with the hint "HTTP headers more than 8 KiB in size".
Operating Instruction
We suggest that you drastically reduce the size of the HTTP header by omitting headers that are irrelevant for the receiving client.
Has Unicode Byte Order Mark (BOM)
Description
This error is thrown if we encounter a robots.txt file with an Unicode Byte Order Mark (BOM).
Importance
There are different documents that specify how to control web crawlers with a robots.txt file. The 1997 Internet Draft specification A Method for Web Robots Control allows only US-ASCII characters. The Internet-Draft Robot Exclusion Protocol from July 01, 2019, also allows UTF8 characters, but no control characters like the BOM.
While most crawlers detect and simply ignore the BOM, some crawler implementations might treat it as part of the first line of the robots.txt file. The directives in the first line might than be interpreted incorrectly.
Operating Instruction
Encode your robots.txt file without a Unicode Byte Order Mark (BOM). Use only characters that are US-ASCII, which is a true subset of UTF-8, to maximise compatibility with older crawlers.
Illegal Charset
Description
This error is triggered if fetching a URL failed due to an invalid or unsupported charset within the HTTP header. In this case our HTTP client is unable to handle the request.
Example
HTTP header that indicates an HTML document with an incorrect charset
Content-Type: text/html; charset=foo
Importance
If you see this error, our HTTP client could not handle the response of the requested URL due to an invalid or unsupported charset. If no charset is given, the default charset is considered to be ISO-8859-1.
This usually indicates problems of the following kind:
- There is a typo or syntax error in the Content-Type HTTP header
- The charset in the Content-Type HTTP header is uncommon and not supported by our HTTP client library
In addition to this error we mark the corresponding URL with the hint "Charset invalid in header"
Operating Instruction
Make sure that you send the proper Content-Type HTTP header including the MIME type and charset for all URLs. Also make sure to have the proper Content-Encoding HTTP header.
In general, you should configure your HTTP server to send a default charset for the whole server, specific directories or MIME types.
Internal Proxy Problems
Description
This error is displayed if a fetch of a URL failed due to internal proxy problems. We schedule a retry for this type of error. If the retry is successful previous errors will be marked as resolved. We will perform up to two additional tries to download URLs that failed due to internal proxy problems.
Importance
This is an internal error that should not occur, and we monitor the occurrence of this error.
Operating Instruction
If you see this type of error you can expect us to take actions to fix the underlying problem in our infrastructure within a short period of time.
Contact us, if you experience major crawl problems in combination with this error.
Internal error
Description
This error is displayed if processing a URL failed in an unexpected way that is currently not properly handled by us. In this case we did not anticipate such an error, and we currently have no way to recover from it or to give additional information.
Importance
Internal errors can happen due to the complexity of the software involved, but are very rare. We monitor the occurrence of these internal errors and add proper error handling to our software as soon as possible.
Operating Instruction
If you see this type of error you can expect us to add proper error handling within our software in a short period of time. Once we change our software the error should be gone in all new crawls.
Contact us, if you experience major crawl problems in combination with this error, so we can prioritize adding a proper error handling for this over other development tasks.
Invalid Compression
Description
This error is thrown if we encounter a response that can't be decompressed. We currently support brotli, gzip and deflate compression and state this in the Accept-Encoding HTTP header.
Importance
If you see this error we successfully established a connection to the server and requested a resource. We received content but can't decompress it.
There are multiple possible reasons:
- The response has a Content-Encoding HTTP header that indicating a compression different from the supported encodings in our Accept-Encoding HTTP request header.
- The response does not have the encoding stated in the Content-Encoding HTTP header and therefore the decompression fails. The content is uncompressed or compressed with another encoding.
Operating Instruction
Make sure that compressed content is encoded in a format supported by the client as stated in the Accept-Encoding HTTP request header and indicate that encoding by using a proper Content-Encoding HTTP header in the response.
Invalid HTTP Request Header
Description
This error is thrown if we encounter an invalid HTTP request header when fetching a URL.
Importance
Within Audisto we allow modifying HTTP request headers. This can be useful to test different versions of the website e.g. by sending a cookie on each request or different language accept headers or user-agents.
If the user input results in invalid HTTP request headers that are rejected by our HTTP client, we throw this error.
Operating Instruction
Make sure to check the HTTP headers added to the crawl. Feel free to contact us, if you need further assistance.
Invalid HTTP Response Header
Description
This error is thrown if we encounter an invalid HTTP response header when fetching a URL. If we encounter this error we schedule a retry. If the retry is successful previous errors will be marked as resolved.
Importance
With HTTP headers there are limitations on which characters or values are allowed. Using characters that are not allowed or using invalid header values triggers this error.
This error is also triggered, if a header occurs multiple times, but should be unique e.g. a Content-Length header.
Operating Instruction
Make sure to check your HTTP response headers for duplicate headers, characters that are not allowed and invalid values.
Invalid Location Header
Description
This error is thrown if we encounter a redirect with an invalid location header. This means the location header does not exist, is empty or contains an invalid URL.
Examples
Location:
Location: httttps:://foo.bar/
Importance
If you see this error we managed to establish a connection to the server, got a response with a redirect HTTP status code but the location header was invalid, and we therefore can't determine a target for the redirect.
This means the redirect is a dead end for bots and user. Users will see a blank screen and likely leave the website.
This usually indicates programming errors.
Operating Instruction
Make sure to change the redirect to send a valid HTTP location header.
Invalid MIME Type
Description
This error is thrown if we encounter a MIME Type that does not follow the naming convention defined in section 4.2 of RFC 6838 or where the top-level type is not within the following list:
- application
- audio
- font
- example
- image
- message
- model
- multipart
- text
- video
Importance
If you see this error we successfully downloaded a document but the MIME Type is invalid, and we therefore don't know how to handle the document.
Note: Browsers might still be able to handle and display documents with an invalid MIME type as they usually have algorithms for MIME type sniffing. MIME type sniffing can however be disabled with an X-Content-Type-Options: nosniff HTTP header.
Operating Instruction
Make sure to use a valid MIME type for all documents.
Invalid Transfer Encoding
Description
This error is obsolete and we no longer throw it. It can happen that this error can still be seen in old crawls.
Location Ignored By Rewrite
Description
This error is thrown if we encounter a URL with a redirect status code and the target location is set to be ignored by a rewrite rule.
Importance
Within Audisto we allow rewriting URLs to perform structural simulations or to ignore parts of the crawled websites. While ignoring links within HTML documents is fine, a redirect should always have a Location header field.
Operating Instruction
Consider also ignoring the URL with the redirect status code or change the rewrite from "Ignore" to "Do not Crawl".
Malformed URL
Description
This error is triggered if fetching a URL failed due to a malformed URL. In this case we did not anticipate that this URL would be a problem for our crawler and tried downloading it anyway.
Importance
Malformed URLs are quite common, and usually we don't store them and don't try to download them. Instead, we mark all URLs that contain links to malformed URLs with corresponding hints. We also show the malformed URLs in the live analysis of the marked URLs.
We monitor the occurrence of this error, and we will adjust our URL handling to handle the URLs that triggered this.
Operating Instruction
If you see this type of error you can expect us to take actions to handle this properly within a short period of time. After that this type of error should not show up in new crawls anymore.
Contact us, if you experience major crawl problems in combination with this error, so we can prioritize adding a proper handling for this over other development tasks.
Missing Content-Type HTTP Header
Description
This error is thrown if the Content-Type HTTP response header is missing when fetching a URL.
Importance
The Content-Type HTTP response header indicates the media type of the response, which is usually used by the client to process the content. Without this header it is unclear how the response should be processed.
Some HTTP clients (especially web browsers) might perform content sniffing to detect the media type but this might also be forbidden by an X-Content-Type-Options header with a nosniff value.
Operating Instruction
Make sure to set a proper Content-Type HTTP response header for all affected URLs.
No Connection
Description
This error is triggered if we can't establish a connection to the server within 15 seconds. If we encounter this error we schedule a retry. If the retry is successful, previous errors will be marked as resolved. We will perform up to two further tries to download this URL.
Importance
If you see this error we failed to establish a connection to the server within 15 seconds and therefore could not request the content for the URL.
This usually indicates networking problems of the following kind:
- The server is down and can't be reach
- Overloaded servers with insufficient resources to handle the request
- Network issues like overloaded network interfaces or unstable connections anywhere between your and our servers or within your own infrastructure.
- A firewall blocks the traffic from our servers to your servers
Not fixing these issues usually has the following effects:
- Search engines will treat connection errors as load indicators and slow down their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website because it is not loading. This decreases page views and conversion rates and has a negative impact on your business.
Note: Within Audisto we handle this error as load indicator as part of our Overload Protection Through Throttling.
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- Try to access the website in your browser to see if there are connection problems for you as well.
- To test the connectivity from various locations start with using one of the many free services that allow website reachability tests from multiple locations worldwide.
- If only some of your requests show no connection errors and other requests complete, it is likely that you have overloaded servers or an overloaded or unstable network.
- If the first requests of a crawl are fine and at some point in time it switches to no connection errors, you might have a firewall that started to block the traffic.
- If all requests show a no connection error a routing problem or firewall block is most likely the issue.
Note: We monitor our network and if we detect general connectivity issues we pause our crawling until the issues are resolved. If you suspect issues for the networking path between our and your servers that can't be resolved on your end feel free to contact us.
No HTTP Headers Sent
Description
This error is triggered if fetching a URL failed due to a missing HTTP header. In this case we did not get an HTTP status code, nor any HTTP response header.
Importance
If you see this error we managed to establish a connection to the server but did not receive a proper response.
This usually indicates problems of the following kind:
- Fundamental server misconfiguration
- Fundamental programming errors
- A firewall blocks the traffic from your servers to our servers
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- Check your server and application error logs
- Check your firewall configuration and consider allowing traffic from our bots
Problems Parsing robots.txt
Description
This error is triggered if we encountered problems parsing the robots.txt file.
Importance
If you see this error we were able to request the robots.txt file and the HTTP status code was 200, but we were unable to process the robots.txt file due to parser errors.
This usually indicates problems of the following kind:
- Instead of
text/plainthe response was HTML or a binary file - The content was compressed but a Content-Encoding indicating the compression was missing
We handle this as an unreachable status for the robots.txt, and therefore we will assume that crawling within the authority of that robots.txt file is completely denied.
Operating Instruction
Make sure to return a proper, parsable robots.txt file for the protocol/host combination with a status code 200, a correct Content-Type HTTP header and if compressed also a correct Content-Encoding HTTP header.
Problems with SSL
Description
This error is thrown if we encounter SSL problems when accessing the URL. This type of error can partially be turned off if SSL handling is set to Relaxed within the advanced crawl settings.
Importance
Our HTTP client encountered an SSL exception when connecting to the server.
This usually indicates SSL problems of the following kind:
- The SSL Certificate has expired
- The SSL Certificate is self-signed
- The SSL Certificate chain is broken
- The SSL Certificate is issued to wrong host
- The TLS protocol version is deprecated (TLSv1.1 and earlier)
Browsers usually block request with SSL problems and show a security warning to the user.
Operating Instruction
Consider setting the SSL handling of our crawler to relaxed if you need to crawl a host without valid SSL certificates.
Check your SSL certificates and security settings with specialized tools like Qualys SSL Labs and correct all problems.
Redirect Loop
Description
This error is obsolete and we no longer throw it. It can happen that this error can still be seen in old crawls.
Redirect Loops are marked through hints instead.
Redirect to self
Description
This error is thrown if we encounter a URL with a redirect status code that redirects to itself.
Importance
There is an obvious problem with the target location of the redirect resulting in an endless loop. Browsers will show an error message indicating this error to the users and the user will most likely bounce.
Operating Instruction
Correct the location of the redirect to point to a valid target with a 200 status code. Consider changing the incoming links of the redirecting URL to avoid unnecessary latencies for users.
Redirect without valid location header
Description
This error is obsolete and we no longer throw it. It can happen that this error can still be seen in old crawls.
Redirects for robots.txt
Description
This error is obsolete and we no longer throw it. It can happen that this error can still be seen in old crawls.
Rendering Timed Out
Description
This error is triggered if the rendering of an HTML page did not produce a result within 40 seconds. After 40 seconds we abort the rendering and schedule a retry. If the retry is successful, previous errors will be marked as resolved. We will perform up to 2 additional tries to download URLs with a timeout.
Importance
If you see this error we managed to request an HTML page from the server and handed it over to the renderer. During the rendering additional resources might be requested or intense computation tasks need to be performed and those tasks did not finish within 40 seconds.
This usually indicates performance problems within your infrastructure or JavaScript code of the following kind:
- Slow code execution for additional requests e.g. caused by inefficient code or database queries
- Overloaded servers with insufficient resources to handle the request
- Network issues like overloaded network interfaces or unstable connections
- Poor performance of JavaScript code
Note: If used, 3rd party infrastructure like adservers, tracking services or content delivery networks can also cause this issue.
Not fixing these issues usually has the following effects:
- Search engines will treat timeouts as load indicators and slow down their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website while it is still loading. This decreases page views and conversion rates and has a negative impact on your business.
- Search engines will use the poor performance metrics measured as a field metric in the browser of your users to adjust your rankings. In the long term this will have a negative impact on your rankings and traffic received from search engines.
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- Check if you enforce any bandwidth limitations within your infrastructure, if so consider lifting them.
- Check if network interfaces within your infrastructure are overloaded, if so consider upgrading to a faster connection. If static files are affected consider using a CDN to host and/or deliver them.
- If only the first try times out and a subsequent request is fast and completes, it is likely that you have performance issues with uncached resources.
- If all the subsequent tries also time out or show a high response time you most likely have issues with slow code execution or overloaded servers.
- If you see slow response times only for certain content types you most likely have issues with slow code execution or specific services within your infrastructure. You can compare the response times of resources that are usually static like CSS, JS and images with the response times of dynamic HTML pages e.g. by creating clusters for those MIME types.
- If you assume networking issues outside your infrastructure, you should start with using one of the many free services that allow website reachability tests from multiple locations worldwide.
Note: We monitor our network and if we detect general connectivity issues we pause our crawling and rendering until the issues are resolved. If you suspect issues for the networking path between our and your servers feel free to contact us.
Rendering: Error Rendering Content
Description
This error is thrown if we encounter problems when rendering a URL.
Importance
This error can occur during our JavaScript execution and rendering process. During this process we hand over the downloaded content to a rendering instance with a headless web browser. If anything fails during rendering we mark the URL with this error. We will perform up to two additional tries to render URLs that failed.
This usually indicates performance problems of the following kind:
- The connection to the rendering instance was lost
- The renderer ran into an error when processing the request
Operating Instruction
If you see this type of error you can expect us to take actions to fix the underlying problem in our infrastructure within a short period of time.
Contact us, if you experience major crawl problems in combination with this error.
Rendering: Incomplete Resources
Description
This error occurs if a resource has a recoverable errors (HTTP status 5xx, timeouts etc.) during direct rendering. This error is always recoverable, and appears only temporarily while we retry requesting the ressource. If the problem persists, the URL is marked with the hint "Rendering: Incomplete Resources" and status is set to "Crawled".
The link to the resource is marked as "Forces Rendering Retry", and the resource is downloaded up to two more times until the error either recovers or becomes permanent. After that, another rendering is tried.
The number of renderings is limited. If the last rendering still contains previously unknown resources that produce recoverable errors, the site is treated as successfully crawled, and is assigned the hint "Rendering: Incomplete Resources" instead of this error.
Examples
Temporary error fetching a resource:
- An HTML page contains JavaScript that is executed during rendering
- During the execution a resource URL is fetched and returns an HTTP Status of 500 - Server Error
- The HTML page is set to error "Rendering: Incomplete Resources"
- The resource URL is retried and turns to HTML status 200 - OK
- The HTML page is rendered again, now successful, and set to status "Crawled"
Permanent error fetching a resource:
- An HTML page contains JavaScript that is executed during rendering
- During the execution a resource URL is fetched and returns an HTTP Status of 500 - Server Error
- The HTML page is set to error "Rendering: Incomplete Resources"
- The resource URL is retried multiple times but still returns HTTP status 500
- The HTML page is rendered again, now successful, since the error is now permanent, and set to status "Crawled"
Temporary errors fetching a resource with cache buster:
- Rendering 1: An HTML page contains JavaScript that is executed during rendering
- During the execution a resource URL is fetched and returns an HTTP Status of 500 - Server Error
- The HTML page is set to error "Rendering: Incomplete Resources"
- The resource URL is retried and turns to HTML status 200 - OK
- Rendering 2: The HTML page is rendered again, but now another resource URL is fetched, which also returns an HTTP status of 500
- The HTML page is set to error "Rendering: Incomplete Resources" again
- The new resource URL is retried and turns to HTML status 200 - OK
- Rendering 3: The HTML page is rendered for the last time, and yet another resource URL is fetched, which also returns an HTTP status of 500
- The HTML page is set to status "Crawled" again, a hint "Rendering: Incomplete Resources" is added
- The last resource URL is retried and turns to HTML status 200 - OK
Importance
Resources that have temporary or permanent errors are a problem, since they prevent the HTML page from being rendered as expected.
Operating Instruction
Fix the cause of temporary or permanent errors.
Required HTTP Header Missing
Description
This Error is thrown if a required HTTP Header was missing in the response.
Importance
Whenever a server answers with an HTTP status code 401 a WWW-Authenticate header field must be sent in the response, as specified in RFC 7235.
Operating Instruction
Make sure to include all required headers in your HTTP response.
Response download took too long
Description
This error is thrown if the download of a requested file started, but took to long to finish. After 15 seconds we abort the request and schedule a retry. If the retry is successful previous errors will be marked as resolved. We will perform up to 2 additional tries to download URLs with a timeout.
Importance
If you see this error we managed to establish a successful connection to the server, requested the file, started getting data from your server, but downloading the content took a long time and did not finish within 15 seconds. Although we enforce limitations for content size, those limits were not reached.
This usually indicates performance problems of the following kind:
- Bandwidth limitations for individual requests or clients enforced by your infrastructure
- Saturation of network interfaces between your infrastructure and our servers
- Unstable connections anywhere within your infrastructure or between your and our servers
- Slow code execution e.g. caused by inefficient code or database queries
- Bandwidth limitations on our end due to saturation of the network interface on our crawl server
Not fixing these issues usually has the following effects:
- Search engines will treat slow downloads as load indicators and slow down their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website while it is still loading. This decreases page views and conversion rates and has a negative impact on your business.
- Search engines will use the poor performance metrics measured as a field metric in the browser of your users to adjust your rankings. In the long term this will have a negative impact on your rankings and traffic received from search engines.
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- Check if you enforce any bandwidth limitations within your infrastructure, if so consider lifting them.
- Check if network interfaces within your infrastructure are overloaded, if so consider upgrading to a faster connection. If static files are affected consider using a CDN to host and/or deliver them.
- If only the first try times out and a subsequent request is fast and completes, it is likely that you have performance issues with uncached resources.
- If all the subsequent tries also time out or show a high response time you most likely have issues with slow code execution or overloaded servers.
- If you see slow response times only for certain content types you most likely have issues with slow code execution or specific services within your infrastructure. You can compare the response times of resources that are usually static like CSS, JS and images with the response times of dynamic HTML pages e.g. by creating clusters for those MIME types.
- If you assume networking issues outside your infrastructure, you should start with using one of the many free services that allow website reachability tests from multiple locations worldwide.
Note: We monitor our network and if we detect general connectivity issues we pause our crawling until the issues are resolved. If you suspect issues for the networking path between our and your servers feel free to contact us.
Retrying After Errors
Description
This error is displayed when a previous fetch of a URL failed, and it is currently scheduled to be fetched again. This error is an intermediate state that changes after re-downloading the page.
Importance
It is expected that a certain number of requests fail with errors due to various reasons. However, the percentage of failed request should be very low. Most errors can be recovered from by just trying again. If you see this error we scheduled the URL to be fetched again. If you see this error at the end of a crawl it means that the crawl terminated because of crawl limits before all URLs could be retried and in this case your project size is too small to crawl all URLs.
Operating Instruction
Have a look at the error history for URLs marked with this error. Make sure to fix the underlying problems.
Consider upgrading the project if you see URLs marked with this error at the end of a crawl.
SSL: Misconfigured Server Name Indication (SNI)
Description
This error is thrown if we encounter a misconfigured Server Name Indication (SNI) when accessing the URL. This type of error can be turned off if SSL handling is set to Relaxed within the advanced crawl settings.
Importance
Our HTTP client encountered an SSL exception with a misconfigured Server Name Indication (SNI) when connecting to the server.
This usually indicates that the SSL certificate is issued to wrong host.
Browsers usually block request with SSL problems and show a security warning to the user.
Operating Instruction
Consider setting the SSL handling of our crawler to relaxed if you need to crawl a host without valid SSL certificates.
Check your SSL certificates and security settings with specialized tools like Qualys SSL Labs and correct all problems.
Strict and Relaxed Parsing Differ
Description
This error is thrown when strict and relaxed parsing of the robots.txt return different results. The robots.txt can be interpreted differently.
Example
Original robots.txt
User-agent: *
Disallow: /
Relaxed parsing result
User-agent: *
Disallow: /
Strict parsing result
User-agent: *
Disallow:
Importance
The relaxed implementation of parsing and handling a robots.txt file is based on the Internet-Draft Robot Exclusion Protocol from July 01, 2019. The strict parsing mode is based on the original 1994 A Standard for Robot Exclusion document and the 1997 Internet Draft specification A Method for Web Robots Control.
With differing parsing results the robots.txt is ambiguous and different web crawler can show completely different crawl behaviour as shown in the example above.
Operating Instruction
Rewrite the robots.txt file to produce identical results for the relaxed and strict parsing method. Take a look at our guide on How to write a good robots.txt to learn about common problems and learn to avoid them.
Timed Out
Description
This error is triggered if downloading a requested file did not produce a response within 15 seconds. After 15 seconds we abort the request and schedule a retry. If the retry is successful previous errors will be marked as resolved. We will perform up to 2 additional tries to download URLs with a timeout.
Importance
If you see this error we managed to establish a successful connection to the server, requested the file, but did not receive any response from the server within 15 seconds.
This usually indicates performance problems within your infrastructure of the following kind:
- Slow code execution e.g. caused by inefficient code or database queries
- Overloaded servers with insufficient resources to handle the request
- Network issues like overloaded network interfaces or unstable connections
Not fixing these issues usually has the following effects:
- Search engines will treat timeouts as load indicators and slow down their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website while it is still loading. This decreases page views and conversion rates and has a negative impact on your business.
- Search engines will use the poor performance metrics measured as a field metric in the browser of your users to adjust your rankings. In the long term this will have a negative impact on your rankings and traffic received from search engines.
Operating Instruction
Try to narrow down the cause. Here is some guidance:
- If only the first try times out and a subsequent request is fast and completes, it is likely that you have performance issues with uncached resources.
- If all the subsequent tries also time out or show a high response time you most likely have issues with slow code execution or overloaded servers.
- If you see slow response times only for certain content types you most likely have issues with slow code execution or specific services within your infrastructure. You can compare the response times of resources that are usually static like CSS, JS and images with the response times of dynamic HTML pages e.g. by creating clusters for those MIME types.
- If you assume networking issues you should start with using one of the many free services that allow website reachability tests from multiple locations worldwide.
Note: We monitor our network and if we detect general connectivity issues we pause our crawling until the issues are resolved. If you suspect issues for the networking path between our and your servers feel free to contact us.
Too Many Redirects
Description
This error is thrown when we encounter a redirect that is preceded by a chain of 20 or more redirects. We do not follow more than 20 redirects in a row.
Example
The 21th redirect in a chain will be marked with the error.
1 -> 2 -> 3 -> 4 -> 5 -> 6 -> ... -> 18 -> 19 -> 20 -> 21 -> ...
Importance
If you see this error we found a redirect chain with more than 20 redirects in a row. We intentionally break the chain in this case.
Long chains like this usually indicate programming errors or conceptional errors and cause unnecessary requests and latencies with negative impacts on user experience.
In general there should not be any redirects within the internal link graph of a website. All links should be changed to point to the target URLs of the redirect chains.
Additionally, if you retrieve data via a JavaScript Fetch, no more than 20 redirects are followed, as stated in the Fetch living standard, so your application may break.
Operating Instruction
Change all redirect targets within the chain to point directly to a URL with a 200 HTTP status code. In addition, change all links pointing towards URLs with redirect status codes to point to the target URLs with a 200 status code.
Too Many Requests
Description
This error is triggered if we encounter a 429 HTTP status code when fetching a URL. A 429 HTTP status codes indicates that the server is overloaded or the client exceeded the allowed number of requests within a timeframe. We will perform up to two additional tries to download URLs with a 429 HTTP status code.
Importance
If you see this error we managed to establish a successful connection to the server but our request was answered with an HTTP status code indicating too many requests with a timeframe.
This usually indicates problems of the following kind:
- The crawl speed settings conflict with the rate limiting setting enforced by the server or firewall
- Your servers are temporarily overloaded e.g. because they currently have to handle a high number of requests.
Not fixing these issues usually has the following effects:
- Search engines often treat 429 HTTP status codes as a directive to slow down or even pause their crawling. This can lead to crawl budget and indexing issues and outdated content in search results.
- Users abandon your website when seeing 429 error pages. This decreases page views and conversion rates and has a negative impact on your business.
- Search engines will deindex outdated content if multiple tries to update failed. In the long term this will have a negative impact on your rankings and traffic received from search engines.
Note: Within Audisto we handle this type of error as load indicator as part of our Overload Protection Through Throttling. We will slow down our crawl and try again.
Operating Instruction
If only the Audisto Crawler is affected, consider adjusting the crawl speed settings within Audisto or the rate limiting of your severs or firewall.
If all Clients are affected consider scaling your infrastructure.
URL too long to be stored
Description
This error is thrown if we encounter a URL that is too long to be stored. We only store URLs up to a length of 60,000 characters. Longer URLs will be shortened and 3 dots will be appended. We will not crawl the URL but mark it with this error and we will also mark the URL with a hint "URL too long for some browsers".
Importance
While theoretically there is no limit on the length of a URL, not all clients and web applications can process long URLs. Some browsers are unable to handle URLs with more than 2.000 characters. Some web applications might not be able to resolve the URLs and/or shorten them automatically, causing issues with access to these URLs.
If you see this error we extracted a very long URL from the parsed content. This usually indicates problems of the following kind:
- The markup of the document is fundamentally broken and therefore content that should not be considered as a URL is extracted as a URL.
- A programming error or conceptional error causes your application to generate very long URLs
Operating Instruction
Make sure all your URLs stay below 2,000 characters to be accessible by a large number of clients and web applications.
Uncategorized Exception
Description
This error is displayed if fetching a URL failed in an unexpected way that is currently not properly handled by us. In this case we did not anticipate such an error, and we currently have no way to recover from it or to give additional information.
Importance
Uncategorized exception can happen due to the complexity of the software involved, but are very rare. In principle, it is our aim to classify and properly handle all errors that occur and to provide guidance for their elimination in the form of documentation.
We monitor the occurrence of these uncategorized exceptions and add a classification and proper error handling to our software as soon as possible whenever we discover unknown new exceptions.
Operating Instruction
If you see this type of error you can expect us to provide a classification with additional information within a short period of time. Once added the classification and additional information will be part of all new crawls.
Contact us, if you experience major crawl problems in combination with this error, so we can prioritize adding a proper exception for this over other development tasks.
Unexpected Content Type
The content was not of expected content type, e.g. we got an HTML page where an XML sitemap was expected.
Unexpected HTTP Status Code
Description
This error is triggered if we see an unexpected HTTP status code when accessing robots.txt files.
Importance
If you see this error we were unable to process the robots.txt file due to an unexpected HTTP status code. We handle this as an unreachable HTTP status code, and therefore we will assume that crawling within the authority of that robots.txt file is completely denied.
Note: Old and new documents for handling robots.txt files show deviations for handling "Unavailable status codes". Depending on the implementation, the "Unavailable status codes" 401 and 403 are interpreted as access completely denied or access completely allowed.
Operating Instruction
Make sure that a valid robots.txt file is present in the root directory for every host and protocol variant. Make sure the HTTP status code of the robots.txt file is always 200 and requests are not blocked e.g. by a firewall or bot detection.
Unknown Host
Description
This error is thrown if we were unable to resolve the host to an IP address. The URL can't be retrieved because we don't know a server to connect to.
Importance
This type of error usually indicates one of the following problems:
- There is a link on your website pointing to a URL on a host that does not exist
- The DNS server responsible for the host or domain cannot be reached
- The responsible DNS server can be reached, but there is no DNS entry for the host
- The DNS entry for the host is new and cannot yet be resolved everywhere
Operating Instruction
Make sure to correct broken links on your website. Check the DNS entries for the host. Consider using a service that offers DNS checks from various locations worldwide.
XML Sitemap: Error Parsing Content
Description
This error is thrown if we encounter a response that states to be an XML sitemap but can't be parsed as such. In this case the Content-Type HTTP header indicated that the downloaded document is XML and the XML namespace indicated that the document is either an XML sitemap file or XML sitemap index file, however parsing the file as such failed.
Importance
If you see this error we successfully downloaded a document that states to be XML and has an XML namespace indicated that the document is either an XML sitemap file or XML sitemap index file. When trying to parse the document we were unable to parse it as a valid XML sitemap or XML sitemap index file.
This usually indicates problems of the following kind:
- There are syntax error in the file
- Not all required elements are present in the file (e.g. a
<loc>is always required and other tags might be required for image, video or hreflang sitemaps)
Note: For extended sitemaps we only check the required URL tags, not other elements (e.g. title) as required by some search engines
Operating Instruction
Validate your XML document with an XML validator. If possible use an XML validator that supports validation against a XSD schema. See the sitemap specification for details.
XML: Error Parsing Content
Description
This error is thrown if we encounter a response that states to be XML but can't be parsed as such. In this case the Content-Type HTTP header indicated that the downloaded document is XML, however the downloaded content was not or at least not well-formed.
Importance
If you see this error, we successfully established a connection to the server and downloaded a document that states to be XML. When trying to parse the document we were unable to parse it as valid XML.
This usually indicates problems of the following kind:
- Another Content-Type was returned instead of an XML document as stated in the Content-Type HTTP header
- The returned document looks like XML but has syntax errors
- The XML document was encoded with a compression (gzip, deflate) but the Content-Encoding HTTP header was missing
- The XML document was compressed twice e.g. by the application and in addition by the webserver
Note: If you can properly access the affected URL in your browser this often means that your browser managed to guess the Content-Encoding correctly. Bots usually don't have the capability to detect the encoding.
Operating Instruction
Make sure that you send the proper Content-Type HTTP header and the proper Content-Encoding HTTP header. Validate your XML document with a XML validator. If possible use an XML validator that supports validation against a XSD schema.