Scopes
Retrieving Data with the Audisto Scripting Language
Scopes extract data from different sources, and make it accessible for later matching. Scopes for example allow to target parts of the URL, or the HTML of a document.
The following sources are supported:
- URL: Accesses different parts of a URL
- HTTP Response: Accesses information available from HTTP response, like HTTP status or HTTP headers
- HTML Document: Allows querying the HTML, if provided
- Processed URLs: Allows querying against analysis results, like computed hints.
Scopes return a list of results, which are all of the same type. Possible types are:
- Case Sensitive String: A text that is case sensitive - "a" and "A" are different.
- Case Insensitive String: A text that is case insensitive - "a" and "A" are treated the same.
- Number: A number, like for example the HTTP Status.
URL Related Scopes
A URL consists of different parts. URL related scopes correspond to these parts.
By setting a scope you define what portion of the URL should be considered for matching against later on.
Assuming a URL like https://www.example.com/some/path?a=b&c=d#top, the parts are (from left to right):
- Scheme:
https- without "://" - Host:
www.example.com - Path:
/some/path- note the leading slash - Query:
a=b&c=d- without the question mark - Fragment:
top- without the hash
More generally a URL is build like this: [Scheme][Host][Path]{?}[Query]{#}[Fragment], where
some parts may be omitted.
URL related scopes correspond to these parts either directly or indirectly. The following scopes are available:
- URL: The URL as a whole -
https://www.example.com/some/path?a=b&c=d#top - Scheme: The scheme -
https - Host: The host -
www.example.com - Path: The path -
/some/path- note the leading slash - Query: The query -
a=b&c=d- note the question mark is not part of the query - File Name: Both path and query -
/some/path?a=b&c=d- note the question mark is part of the file name - Parameter: Single elements of query -
a=bandc=d- note both question mark and the ampersand are gone. This scope is countable. - Resource: Provides meta data on the URL and can currently be used only with the matchers
IS INTERNALandIS EXTERNAL
Response Related Scopes
If a URL is downloaded, rules can be written against the HTTP response. The possible scopes are:
- HTTP Body: Body of response. See below for details.
-
HTTP Header: Single element of HTTP header in the format
name: value. Likecache: no-cache. Note: there always is a whitespace between the double colon and the header value. The scope provides case insensitive comparison and can be counted. - HTTP Status: The HTTP status code returned by a request. Like
200or404. A numeric value. - Response Time: The time it took to download a resource in milliseconds. A numeric value.
- Time to First Byte: The time it took until first byte of response was received. A numeric value.
- Uncompressed Content Size: The size of the content in bytes. A numeric value.
- Compressed Content Size: The size of the content in bytes after being compressed using gzip. A numeric value.
- MIME Type: The MIME type of the resource, as specified by the Content-Type HTTP header.
Like
text/htmlorimage/png. The scope provides case insensitive comparison.
HTTP Body
Scope HTTP Body delivers the content of an HTTP response. The content is in format
- Plain Text - if the HTTP response's MIME type is textual. Textual MIME types are all types of
text/*, and all XML, HTML, JavaScript and JSON formats. - Hexadecimal String - in case of binary formats. Each byte is converted to a lowercase hexadecimal string and prefixed with "0x".
For example, all PNG images start with one byte of value 137 (89 hex), followed by the letters P, N, and G from ASCII charset,
which have code points 80 (50 hex), 78 (4E hex), and 71 (47 hex). The according hexadecimal representation is 0x89504e47
HTML Related Scopes
Whenever an HTML page is successfully retrieved, the following HTML related scopes are available:
- HTML: The full HTML document of the current page. This scope can be used with HTML related matchers.
- Text: The full content of HTML document, stripped of all tags. This scope is case sensitive.
- Charset: Charset of document. Like
UTF-8orISO-8859-1. The scope provides case insensitive comparison. - Language: Language of document, if set. Like
en-USorde. The scope provides case insensitive comparison. - Meta Robots: <meta robots> directives. Returns computed robots directives. The scope provides case insensitive comparison. See below for details.
- Title: The site's title, if set. This scope is case sensitive.
- Meta Description: The site's meta description, if set. This scope is case sensitive.
- H1: The site's <h1> headings, if set. This scope is case sensitive, and can be counted.
- Heading: The site's headings (<h1> to <h6>), if set. This scope is case sensitive, and can be counted.
Turning HTML into text
If HTML is turned into text, like with the Text, Title, or H1 scopes,
only the content inside tags is considered. All attribute values are ignored. The content of <script> is
also ignored.
For Text, the content of the <title> tag therefore is included, the content of
<meta name="description"> is not.
After text is extracted, all whitespace is unified. This means, that any whitespace, like tabs or line breaks, is turned into a space character. After that, repeating space characters are replaced by a single one. Leading and trailing whitespace is removed. This is similar to how a browser would display whitespace.
For example
<p>Hello World</p>
<p> Hello World </p>
<p> <span title="Say it">Hello</span> World </p>
would all result in the same extracted text of "Hello World".
Checking against meta robots
The meta robots scope works against computed robots directives. This means missing directives are filled up with default values.
The following directives will always be added:
- index: If neither index or noindex are set
- follow: If neither follow or nofollow are set
- archive: If neither archive or noarchive are set
For example, a robots declaration like
<meta name="robots" content="nofollow">
would result in a computed robots directive of index, nofollow, archive.
Since a constant order of robots meta directives can not be ensured, they must be queried one by one using
AND, like so:
CLUSTER "noindex nofollow" IS: Meta Robots Equals "noindex" AND Meta Robots Equals "nofollow"
Processed Related Scopes
Whenever an HTML page is successfully retrieved and analyzed, the following scopes are available:
- Hint: All triggered hints, if any. Note that not all hints may be available here, since the scope is still evaluated during the crawling stage.
- Content Class: Content class (e.h. HTML, Image, Sitemap) of URL.
- Image Width: Width of an image. A numeric value, measured in pixel.
- Image Height: Height of an image. A numeric value, measured in pixel.
For rendered HTML pages some more scopes are available:
- First Contentful Paint: Web vitals first contentful paint metric. A numeric value, measured in milliseconds.
- Largest Contentful Paint: Web vitals largest contentful paint metric. A numeric value, measured in milliseconds.
- Cumulative Layout Shift: Web vitals cumulative layout shift metric. A numeric value.
- Total Blocking Time: The time the rendering process is blocked from accepting user interaction. A numeric value, measured in milliseconds.
- Finish Time: Time it took to render the page. A numeric value, measured in milliseconds.
- Rendered HTML Size: Size of rendered HTML. A numeric value, measured in bytes.
- Browser Console Message: List of messages from all browser console entries. A case insensitive string.
- Browser Console Type: List of types from all browser console entries. A case insensitive string. Type usually is one of "log", "warning", "error", "debug", or "info".
- Browser Console Source: List of sources from all browser console entries. A case insensitive string.
- JavaScript Error Message: List of messages from all JavaScript errors. A case insensitive string.
- JavaScript Error Type: List of types from all browser JavaScript errors. A case insensitive string.
- JavaScript Error Stack Trace: List of stack traces from all browser JavaScript entries. A case insensitive string.
- Browser Dialog Message: List of messages from all browser dialogs. A case insensitive string.
- Browser Dialog Type: List of types from all browser dialogs. A case insensitive string. Type usually is one of "alert", "confirm", "prompt", or "beforeunload".
- Failed Rendering Request Message: List of error messages from all failed rendering requests. A case insensitive string.
- Failed Rendering Request HTTP Method: List of HTTP methods from all all failed rendering requests. A case insensitive string. Type usually is one of "GET", "POST", "PUT", "DELETE", "HEAD", "PATCH", "OPTIONS", "TRACE", or "CONNECT".
- Failed Rendering Request URL: List of failed rendering request URLs. A case insensitive string.
- Failed Rendering Request HTTP Status: List of HTTP status from all failed rendering requests. A numeric value.
Some values are categorized into the three categories "good", "improvable" and "poor". There is an according scope for every of these values:
- First Contentful Paint Category: Category of web vitals first contentful paint metric.
- Largest Contentful Paint Category: Category of web vitals largest contentful paint metric.
- Cumulative Layout Shift Category: Category of web vitals cumulative layout shift metric.
- Total Blocking Time Category: Category of the time the rendering process is blocked from accepting user interaction.
- Finish Time Category: Category of the time it took to render the page.
- Time to First Byte Category: Category of the time it took until first byte of response was received.