History Data in Audisto API 2.0

The History API allows retrieving aggregated data of crawls over time. The general path is

/2.0/histories/

This returns a list of objects containing the history's crawl and scheduling settings, which mostly resembles a crawl object.

To access a single history use

/2.0/histories/{id_history}/

History data can be accessed using:

/2.0/histories/{id_history}/clusters/{reference_id}/{aggregate_id}

It returns a history data list, which will be described below.

Building the Path

History Cluster References

The Audisto history is traced by cluster. Since a cluster's IDs may change from crawl to crawl, clusters are referenced using a fixed string. These reference IDs are user defined, however there are a couple of default clusters that provide a fixed reference ID:

  • !root: The cluster "All Paged"
  • !external: The cluster containing all external pages
  • !internal: The cluster containing all internal pages

Aggregate IDs

History data is grouped by aggregates, which mostly but not necessarily correspond to properties of crawled URLs. A full list can be found below.

Return Format

The JSON format returned consists of the following members:

  • type: An extended enumeration value
  • items: An array of data points

The JSON format is not fixed, but may be extended at any time, whenever we add new features. It will however always be backward compatible.

Type

The type is an extended enumeration value, that has the following properties:

  • id: The aggregate ID
  • value: Name of aggregate
  • description: A description or explanation for this aggregate
  • unit_value: The unit of the value, e.g. "Level" for all level based aggregates
  • unit_aggregated: The unit of the aggregated value. E.g. "URLs" for counter.
  • member_name: The name of a single data point value. This mostly is the same as the name or a singular version of the name.

The extended properties are only returned for deep responses. If deep=0, only a standard enumeration is returned.

Data Points

Data points have the following properties:

  • id: A unique ID for this data point
  • value: The value of this data point. The format depends on the aggregate type, it can be an integer, a floating point value, or an enumeration. E.g. for the HTTP status, the value would be 200, 301, 404 or similar.
  • aggregated: The aggregated values, usually a counter of affected URLs. Numeric, but not necessarily an integer. For average, for example, these are float values.

In a deep response is requested, the JSON is extended by

  • execution_time: Date and time of the originating crawl
  • id_crawl: ID of originating crawl

Filtering

History data can be filtered.

The following filters are supported:

  • value: Filter by value. Values are generally integers, but their meaning may differ (e.g. enumeration ID), depending on the aggregate.
  • range: A range. See below.

Ranges

History data can be filtered by range. A range can be

  • offset- or date-based
  • inclusive or exclusive

The general syntax of a range is

opening boundary start .. end closing boundary

Boundaries mark inclusive (start or end of range is part of selection) or exclusive ranges. Allowed boundaries are:

  • (: Begin of exclusive range
  • ): End of exclusive range
  • [: Begin of inclusive range
  • ]: End of inclusive range

Inclusive boundaries are default and can be omitted.

For range start and end, integers and dates are supported.

Integers are treated as offsets, counting from the latest - that is youngest - data source. Offsets are zero-based.

Dates must be provided in ISO 8601 format.

Supported are calendar dates in format YYYY-MM-DD. The shorten format YYYYMMDD is not supported.

Time and timezone are optional. A time may only be specified up to the level of seconds. Milliseconds are not supported. Hours, minutes and seconds must be divided by a colon - like in HH:MM:SS, the shorten format of HHMMSS is not supported.

Examples

Return counters for all HTTP status over time:

/2.0/histories/{id}/clusters/!root/2?deep=1

Return number of crawled URLs over time:

/2.0/histories/{id}/clusters/!root/1?deep=1&filter=value:2

Return aggregated counters for all redirects:

/2.0/histories/{id}/clusters/!root/2?deep=1&filter=value:[301,302,303,307,308]

Get data for last ten crawls:

?filter=range:0..10)

Get data for February 2019:

?filter=range:[2019-02-01..2010-03-01)

Get data for 4th February 2019, 8:00 to 18th February, 9:15 :

?filter=range:[2019-02-04T08..2019-02-18T09:15]

Get last ten crawls from February 2019:

?filter=range:[2019-02-01..9]

or

?filter=range:2019-02-01..9

Note that an inclusive offset of 9 ist the same than an exclusive offset of 10.

Appendix

List of Aggregate IDs

The following aggregate IDs are in use currently:

  • 1: Status
  • 2: HTTP Status
  • 3: Response Time
  • 4: Discovered URLs per Level
  • 5: Crawled URLs per Level
  • 6: Hints
  • 7: Links
  • 8: Content Size Uncompressed
  • 9: Content Size Compressed
  • 10: Duplicate Content Groups Counter
  • 11: Duplicate Content Pages Counter
  • 12: Totals
  • 13: Language
  • 14: Charset
  • 15: Content Type
  • 16: PageRank per Level
  • 17: CheiRank per Level
  • 18: Total Ranks
  • 19: Indexable
  • 20: Discovered URLs per User-Level
  • 21: Crawled URLs per User-Level
  • 22: Level Relation
  • 23: Averages
  • 24: Check Passed URLs
  • 25: Check Failed URLs
  • 26: Isolation Levels
  • 27: Errors
  • 28: Check Results
  • 29: Requirement Results
  • 30: Check Overall Results
  • 31: PageRank per Status
  • 32: CheiRank per Status
  • 33: PageRank per HTTP Status
  • 34: CheiRank per HTTP Status
  • 35: PageRank per Indexability
  • 36: CheiRank per Indexability
  • 37: PageRank per Isolation Level
  • 38: CheiRank per Isolation Level
  • 39: PageRank per User-Level
  • 40: CheiRank per User-Level
  • 41: PageRank per Host
  • 42: CheiRank per Host
  • 43: PageRank per Internal Host
  • 44: CheiRank per Internal Host
  • 45: PageRank per External Host
  • 46: CheiRank per External Host
  • 47: URL Rewrites
  • 48: PageRank crawled URLs per Level
  • 49: CheiRank crawled URLs per Level
  • 50: PageRank crawled URLs per User-Level
  • 51: CheiRank crawled URLs per User-Level
  • 52: Hreflang Groups