The Pages API allows retrieving data of URLs found during a crawl. The general path is
It returns a chunked list of page objects.
A single page can be retrieved using the ID:
It returns a single page object.
The JSON format returned consists of the following members:
id: ID of the URL as unsigned integer
url: The url as string. Attention: URLs can be up to 60,000 characters long
level: The URL's bot level
status: The URL's status (Crawled, Unprocessed, etc.) as enumeration
error: Error code as enumeration, usually "None"
download_time: Date and time the URL was last downloaded in ISO 8601 notation. May also be empty if the URL was not yet downloaded.
user_level: The URL's user level. This field is omitted, if the page has no user level.
The JSON format is not fixed, but may be extended at any time, whenever we add new features. It will however always be backward compatible.
If a deep response is requested, the JSON is extended by additional members.
properties contains an object of extended properties and their respective values. The type of the value depends on the property.
Possible properties include:
http_status: HTTP status of URL if any
response_time: Milliseconds spend downloading the URL, if any
language: Language of document, if any
charset: Charset of document, if any
content_type: Content type of document, if any
indexable: Indexability of document
isolation_level: Isolation level of URL
hints contains an array of all hints triggered by this URL. Each hint is an enumeration.
ranks contains an object with all the ranks calculated:
page_rank: The PageRank
page_rank_index: The PageRank index
chei_rank: The CheiRank
chei_rank_index: The CheiRank index
two_d_rank_index: The overall 2D-Rank index
clusters contains an array with all the clusters, the URL is part of.
errors contains an array of all downloads attempts that failed, together with the time and error that occured.
host contains an object with information about the the host of the URL
duplicate_content contains an array of duplicate content groups the URL is part of. They have following members:
id: ID of duplicate content group
source: Source of duplicate, like title or body. An enumeration.
counter: Number of elements inside this group.
hreflang_members contains a array of hreflang related informations about this URL.
URLs can be filtered.
The following filters are supported:
- The URLs core properties:
- The host by id:
- All known extended properties, e.g.
- Containing cluster by ID:
- Hint by ID: