Pages in Audisto API 2.0

The Pages API allows retrieving data of URLs found during a crawl. The general path is

/crawls/{id_crawl}/pages/

It returns a chunked list of Page objects.

A single page can be retrieved using the ID:

/crawls/{id_crawl}/pages/{page_id}    

It returns a single Page object.

Return Format

The JSON format returned consists of the following members:

  • id: ID of the URL as unsigned integer
  • url: The url as string. Attention: URLs can be up to 60,000 characters long
  • level: The URL's bot level
  • status: The URL's status (Crawled, Unprocessed, etc.) as enumeration
  • error: Error code as enumeration, usually "None"
  • download_time: Date and time the URL was last downloaded in ISO 8601 notation. May also be empty if the URL was not yet downloaded.
  • user_level: The URL's user level. This field is omitted, if the page has no user level.

The JSON format is not fixed, but may be extended at any time, whenever we add new features. It will however always be backward compatible.

Deep format

If a deep response is requested, the JSON is extended by additional members.

Properties

The member properties contains an object of extended properties and their respective values. The type of the value depends on the property.

Possible properties include:

  • http_status: HTTP status of URL if any
  • response_time: Milliseconds spend downloading the URL, if any
  • language: Language of document, if any
  • charset: Charset of document, if any
  • content_type: Content type of document, if any
  • indexable: Indexability of document
  • isolation_level: Isolation level of URL

Hints

The member hints contains an array of all hints triggered by this URL. Each hint is an enumeration.

Ranks

The member ranks contains an object with all the ranks calculated:

  • page_rank: The PageRank
  • page_rank_index: The PageRank index
  • chei_rank: The CheiRank
  • chei_rank_index: The CheiRank index
  • two_d_rank_index: The overall 2D-Rank index

Clusters

The member clusters contains an array with all the clusters, the URL is part of.

Errors

The member errors contains an array of all downloads attempts that failed, together with the time and error that occured.

Host

The member host contains an object with information about the the host of the URL

Duplicate Content

The member duplicate_content contains an array of duplicate content groups the URL is part of. They have following members:

  • id: ID of duplicate content group
  • source: Source of duplicate, like title or body. An enumeration.
  • counter: Number of elements inside this group.

Hreflang

The member hreflang_members contains a array of hreflang related informations about this URL.

Filtering

URLs can be filtered.

The following filters are supported:

  • The URLs core properties: status, level, url, error, user_level
  • The host by id: host
  • All known extended properties, e.g. http_status, language
  • Containing cluster by ID: cluster
  • Hint by ID: hint