Rendering HTML pages using the Audisto Crawler

Main goals

Our main goals regarding rendering are

  • Accuracy: We use a very recent Chrome browser to render the page similar to search engines and your users.
  • Compliance: We respect the robots.txt exclusion protocol for every host, apply your URL rewrites and follow the same rules as during normal crawling.
  • Efficiency: To prevent your servers from being flooded by requests, we identify as many resources as possible in advance, prefetch them gracefully and cache them.

Before starting

Rendering a page in a browser is quite different from just downloading and parsing the HTML:

  • Rendering often takes notably more time than downloading, talking seconds vs milliseconds. Expect rendering enabled crawls to be slower by a factor of ten or more.
  • Rendering often discovers notably more URLs. Expect crawling limits to be reached earlier.
  • Rendering may request tracking URLs, distorting statistics your business decisions may rely on.
  • Rendering may request ads, distorting statistics, since ads are displayed to a bot. This can make your website less attractive for advertisers, leading to less revenue.
  • Rendering downloads all resources required to render - including external URLs, e.g. from your CDN. This creates additional traffic, you might get charged for.

Not all ad or tracking services shield themselves by disallowing bot access (e.g. through robots.txt). You can block them using URL rewrites.

To suppress requests to a host use URL rewrites. To exclude the host from crawling, create a rewrite rule which matches URLs ending with the host's name, and sets either the "Do Not Crawl" or the "Ignore" directive.

Example of URL rewrite that excludes all URLs from host myanalytics.service from crawling

Enabling rendering

To use rendering, navigate to the "Advanced" tab in project or crawl settings and check the checkbox "Enable rendering and JavaScript execution".

Specify whether requests using non-GET HTTP request methods (e.g., POST, PUT, DELETE, etc.) should be performed. If this is not set, we will block these requests.

Enable rendering and configure non-GET HTTP request handling by checking the boxes in project or crawl settings

Rendering stages

Prefetching

First step is downloading an HTML page. For full control, this is done by our crawler, not by a browser. The downloaded HTML is cached afterward.

After downloading, we identify all possible resources - like images, scripts, CSS files etc. - and start downloading and caching them right away.

Every resource found in this phase creates a link between the HTML page and the resource. The link is assigned the context "Prefetching".

The HTML page is set to status Waiting to be rendered. After all known resources are available, the page will be handed over to rendering.

This is what happens during prefetching:

Status Activity
Unprocessed Downloading HTML and putting it into cache. Extracting all resources like images, scripts, videos...
Waiting to be Rendered Downloading all previously identified resources and putting them into cache
Send HTML page to browser

Rendering

After all known resources have been downloaded, the HTML page is sent to a browser, which starts to render it.

Every page is rendered in a fresh browser instance, to avoid side effects, e.g. from cookies set or browser history.

For every request issued by the browser we ensure that:

  • robots.txt directives are respected
  • URL rewrites are applied
  • Cached content is used if available
  • Restrictions for unverified hosts are respected

Every resource requested by the browser creates a link between the HTML page and the resource. The link is assigned the context "Rendering".

If a resource requested by the browser is not cached, it is downloaded and cached on the fly. The link between the HTML page and this resource is assigned the context "Direct Rendering".

After rendering is successfully finished, the rendered HTML is analyzed, and the page status becomes "Crawled".

In case the rendering failed, e.g. because all browser memory was used up, the HTML page is set to error "Rendering: Error Rendering Content".

This is what happens during rendering:

Status Activity
Waiting to be Rendered Send HTML page to browser
Start rendering
Serve requests from cache or download directly
Process rendered HTML
Crawled/Error Set status of HTML page to either "Crawled" or "Error"

Rendering with incomplete resources

During rendering, downloading a resource on the fly may result in a recoverable error. We call this an "incomplete resource".

Recoverable errors for example include:

  • Timeouts
  • Server errors (5xx HTTP status codes)
  • Connection problems

The crawler's policy is to try to recover such errors by re-downloading the resources after some time.

Recovering from errors is not possible in the context of Direct Rendering without greatly distorting the rendering result. Therefore

  • The HTML page is set to the error "Rendering: Incomplete Resources"
  • The link between the HTML page and an incomplete resource is marked by the property "Forces Rendering Retry"
  • Incomplete resources are downloaded again.

To indicate, which resources were incomplete during rendering, the link between HTML page and resource is marked with a red "reload" icon.

A resource that forced a rendering retry shows a red reload icon next to the context "Direct Rendering"

If all erroneous resources are either recovered or permanently set to status "Error", the rendering is retried. The number of rendering retries for an HTML page is limited.

If the last rendering retry still contains incomplete resources

  • Rendering uses the incomplete resources to determine rendered HTML
  • The page status is set to "Crawled"
  • A hint "Rendering: Incomplete Resources" is added.

The error "Rendering: Incomplete Resources" will always recover. If you see this error in your crawl also check for the hint of the same name.

This is an example of what happens when incomplete resources are encountered:

Status Activity
Waiting to be Rendered Download requests directly
An image has a recoverable error
Error Set page to Error "Rendering: Incomplete Resources
Download image again
Waiting to be Rendered Send HTML page to browser again

Limitations

The Audisto Crawler currently has some limitations.

  • HTTP requests using an HTTP request method other than GET (POST, PUT, DELETE, etc.) are performed but not cached, and our user interface does not provide detailed information such as request payload, HTTP headers, or response times for these requests.
  • Cache related HTTP headers are ignored