Pages in Audisto API 2.0
The Pages API allows retrieving data of URLs found during a crawl. The general path is
/2.0/crawls/{id_crawl}/pages/
It returns a chunked list of page objects.
A single page can be retrieved using the ID:
/2.0/crawls/{id_crawl}/pages/{page_id}
It returns a single page object.
Return Format
The JSON format returned consists of the following members:
id
: ID of the URL as unsigned integerurl
: The url as string. Attention: URLs can be up to 60,000 characters longlevel
: The URL's bot levelstatus
: The URL's status (Crawled, Unprocessed, etc.) as enumerationerror
: Error code as enumeration, usually "None"download_time
: Date and time the URL was last downloaded in ISO 8601 notation. May also be empty if the URL was not yet downloaded.user_level
: The URL's user level. This field is omitted, if the page has no user level.
The JSON format is not fixed, but may be extended at any time, whenever we add new features. It will however always be backward compatible.
Deep format
If a deep response is requested, the JSON is extended by additional members.
Properties
The member properties
contains an object of extended properties and their respective values. The type of the value depends on the property.
Possible properties include:
http_status
: HTTP status of URL if anyresponse_time
: Milliseconds spend downloading the URL, if anylanguage
: Language of document, if anycharset
: Charset of document, if anycontent_type
: Content type of document, if anyindexable
: Indexability of documentisolation_level
: Isolation level of URL
Hints
The member hints
contains an array of all hints triggered by this URL. Each hint is an enumeration.
Ranks
The member ranks
contains an object with all the ranks calculated:
page_rank
: The PageRankpage_rank_index
: The PageRank indexchei_rank
: The CheiRankchei_rank_index
: The CheiRank indextwo_d_rank_index
: The overall 2D-Rank index
Clusters
The member clusters
contains an array with all the clusters, the URL is part of.
Errors
The member errors
contains an array of all downloads attempts that failed, together with the time and error that occured.
Host
The member host
contains an object with information about the the host of the URL
Duplicate Content
The member duplicate_content
contains an array of duplicate content groups the URL is part of. They have following members:
id
: ID of duplicate content groupsource
: Source of duplicate, like title or body. An enumeration.counter
: Number of elements inside this group.
Hreflang
The member hreflang_members
contains a array of hreflang related informations about this URL.
Filtering
URLs can be filtered.
The following filters are supported:
- The URLs core properties:
status
,level
,url
,error
,user_level
- The host by id:
host
- All known extended properties, e.g.
http_status
,language
- Containing cluster by ID:
cluster
- Hint by ID:
hint