Scopes

Retrieving Data with Audisto Scripting Language

Scopes extract data from different sources, and make it accessible for later matching. Scopes for example allow to target parts of the URL, or the HTML of a document.

The following sources are supported:

  • URL: Accesses different parts of a URL
  • HTTP Response: Accesses information available from HTTP response, like HTTP status or HTTP headers
  • HTML Document: Allows querying the HTML, if provided
  • Processed URLs: Allows querying against analysis results, like computed hints.

Scopes return data of different types:

  • Case Sensitive String: A text that is case sensitive - "a" and "A" are different.
  • Case Insensitive String: A text that is case insensitive - "a" and "A" are treated the same.
  • Number: A number, like for example the HTTP Status.

URL Related Scopes

A URL consists of different parts. URL related scopes correspond to these parts.

By setting a scope you define what portion of the URL should be considered for matching against later on.

Assuming a URL like http://www.example.com/some/path?a=b&c=d#top, the parts are (from left to right):

  • Protocol: http://
  • Host: www.example.com
  • Path: /some/path - note the leading slash
  • Query: a=b&c=d - without the question mark
  • Fragment: top - without the hash

More generally a URL is build like this: [Protocol][Host][Path]{?}[Query]{#}[Fragment], where some parts may be omitted.

URL related scopes correspond to these parts either directly or indirectly. The following scopes are available:

  • URL: The URL as a whole - http://www.example.com/some/path?a=b&c=d#top
  • Host: The host - www.example.com
  • Path: The path - /some/path - note the leading slash
  • Query: The query - a=b&c=d - note the question mark is not part of the query
  • File Name: Both path and query - /some/path?a=b&c=d - note the question mark is part of the file name
  • Parameter: Single elements of query - a=b and c=d - note both question mark and the ampersand are gone. This scope is countable.

Response Related Scopes

If a URL is downloaded, rules can be written against the HTTP response. The possible scopes are:

  • HTTP Header: Single element of HTTP header in the format name: value. Like cache: no-cache. Note: there always is a whitespace between the double colon and the header value. The scope provides case insensitive comparison and can be counted.
  • HTTP Status: The HTTP status code returned by a request. Like 200 or 404. A numeric value.
  • Response Time: The time it took to download a resource in milliseconds. A numeric value.
  • Uncompressed Content Size: The size of the content in bytes. A numeric value.
  • Compressed Content Size: The size of the content in bytes after being compressed using gzip. A numeric value.
  • Mime Type: The mime type of the resource, as specified by the Content-Type HTTP header. Like text/html or image/png. The scope provides case insensitive comparison.

HTML Related Scopes

Whenever a HTML page is successfully retrieved, the following HTML related scopes are available:

  • HTML: The full HTML document of the current page. This scope can be used with HTML related matchers.
  • Charset: Charset of document. Like UTF-8 or ISO-8859-1. The scope provides case insensitive comparison.
  • Language: Language of document, if set. Like en-US or de. The scope provides case insensitive comparison.
  • Meta Robots: <meta robots> directives. Returns computed robots directives. The scope provides case insensitive comparison. See below for details.
  • Title: The site's title, if set. This scope is case sensitive.
  • Meta Description: The site's meta description, if set. This scope is case sensitive.
  • H1: The site's <h1> headings, if set. This scope is case sensitive, and can be counted.
  • Heading: The site's headings (<h1> to <h6>), if set. This scope is case sensitive, and can be counted.

Checking against meta robots

The meta robots scope works against computed robots directives. This means missing directives are filled up with default values.

The following directives will always be added:

  • index: If neither index or noindex are set
  • follow: If neither follow or nofollow are set
  • archive: If neither archive or noarchive are set

For example, a robots declaration like

<meta name="robots" content="nofollow">

would result in a computed robots directive of index, nofollow, archive.

Since a constant order of robots meta directives can not be ensured, they must be queried one by one using AND, like so:

CLUSTER "noindex nofollow" IS:
  Meta Robots Equals "noindex" AND Meta Robots Equals "nofollow"

Processed Related Scopes

Whenever a HTML page is successfully retrieved and analyzed, the following scopes are available:

  • Hint: All triggered hints, if any.

Note that not all hints may be available here, since the scope is still evaluated during the crawling stage.