Connection Handling

Set the protocol and connection handling method

The Audisto Crawler can be configured to use specific protocols and connection methods during crawling. This is done using the Connection Handling setting when configuring a project or a crawl.

Configuration dialog for connection handling

Connection methods

The following connection settings are available:

  • HTTP 1.1
  • HTTP 1.1 with connection reuse support

The default setting for new projects is HTTP 1.1.

With the default setting "HTTP 1.1" our crawler creates a new connection for every HTTP request and closes the connection after the request. This comes with an overhead required to establish the connection before sending the HTTP request. The latency for establishing the connection is included in the measured response time.

With the setting "HTTP 1.1 with connection reuse support" our crawler creates the amount of connections that are specified as Parallel Downloads and uses them for HTTP requests. The connections are established with the first request, and they are kept open so they can be reused for subsequent requests. This reduces the overhead to establish new connections. The latency for establishing the connection is only included in the measured response time of the first request using the connection.

Implications of the connection settings

Response time measurements

Since the overhead for establishing the connection is always measured with the "HTTP 1.1" setting, all measured response times are comparable. Measuring the overhead for establishing the connection also corresponds to the behavior that a user of the page would experience when the page is called up for the first time.

In comparison, with the "HTTP 1.1 with connection reuse support" setting, the overhead for establishing the connection is only measured if the HTTP requests lead to a new connection. This is the case for the first requests until the number of connections set as Parallel Downloads is established. New connections are also established if the connection is broken during the crawl, e.g. because the server actively disconnects the open connections after a certain number of requests and these have to be re-established. The overhead is then also measured for the following requests. The measured response times are not always comparable with this setting, but crawling the page is more efficient and uses fewer resources.

Number of open connections

Web servers have a setting for the maximum number of possible simultaneous connections. While the connections are repeatedly closed and re-established with the "HTTP 1.1" setting, the "HTTP 1.1 with connection reuse support" setting tries to keep the number of connections specified as Parallel Downloads open permanently. For the duration of the entire crawling process connections are occupied by our crawler and those are not available to other users.

If the value for the maximum number of simultaneous connections is too low, it can happen that our crawler occupies a significant part of the possible connections. If the maximum number of simultaneous connections is fully exhausted, further connection attempts are rejected by the server, access to the site is then no longer possible until at least one connection is closed.

Firewalls

Some firewalls block IP addresses that establish a large number of new connections in a short time, as this behavior is often observed in Denial of Service attacks. If there are many errors in the crawl's error log that indicate connection problems, such as "No Connection" or "Too many Requests", it can be helpful to switch to the "HTTP 1.1 with connection reuse support" setting. This setting drastically reduces the number of newly established connections and avoids blocking by such firewall rules.