Validate XML Sitemaps Using the Audisto Crawler

Check your sitemap implementation for inconsistencies

With our crawler we offer a XML sitemap tester / checker / validator capable of validating XML sitemap index files and XML sitemap files added to a crawl, even in large multi domain environments with hreflang.

About XML Sitemaps

XML sitemaps allow search engines to discover pages on your site more easily. With additional metadata about each URL (e.g. when it was last updated) search engines can also crawl recrawl pages faster, when they change.

The Sitemaps XML Format has a wide adoption and is used by most major search engines.

Discovering XML Sitemaps

Our crawler will not discover sitemaps by itself. However, if it encounters a sitemap, it will recognize it, parse it and extract all links.

To be recognized as sitemap, the HTTP response

  • must contain a content-type HTTP header of "application/xml" or "text/xml"
  • must contain valid sitemap markup

Unless your sitemap is linked via some of your HTML pages - which it should not - the only way to make our crawler discover your sitemaps is to set it as starting point or as an additional starting point.

Additional starting points can be configured on the "Essentials" settings tab, when creating a crawl or configuring a project.

We recommend adding sitemaps as additional starting points, since these do not have a user level by default.

How We Treat XML Sitemaps

Our parser understands both XML sitemaps and XML sitemap index files. It does not understand RSS feeds, ATOM feeds or text sitemaps.

We require that

  • the XML of a sitemap is valid
  • the XML follows the structure that is given in the Sitemaps XML format specification
  • the XML does not contain more than 50,000 URLs (<loc> tags)
  • the file is less than 50 MiB in size

Any violations of these requirements will lead to an error of "XML Sitemap: Error Parsing Content". In case the reason for this error is not immediately obvious, use a sitemap validator that also validates against the scheme, like Webmaster World.

We tolerate that

  • the URLs given by the sitemaps are invalid - such URLs are simply ignored. Note however, that all URLs must be absolute
  • URLs point outside the authority of the sitemap file (e.g. to other domains or directories)
  • a sitemap index file references other sitemap index files

We usually indicate potential issues through corresponding hints.

The discovery of links from XML sitemaps can be enabled and disabled in the "Links" section under ther "Advanced" tab of the crawl configuration.

Links found in sitemap are created like regular links from HTML. They do not transfer user level (because they are for bots, not humans), but they do transfer bot level. Do not expect meaningful bot level data when using sitemaps.

In the isolation level analysis, a link from a sitemap to an index page always sets the target URL to be "Reachable".

How We Support XML Sitemap Extensions

We support the following extensions to the sitemap protocol:

We require that

  • required elements for video and image sitemaps exist
  • video sitemap has at least one of <video:content_loc> or <video:player_loc>

We tolerate that

  • the URLs given by the extensions are invalid - such URLs are simply ignored. Note however, that all URLs must be absolute
  • URLs point outside the authority of the sitemap file (e.g. to other domains or directories)
  • links given by <xhtml:link> are not hreflang links (rel is not "alternate" or hreflang is not set). These however are ignored.

Links found through an extension are created as links between the URL found in the according <loc> and the URL given by the extension.

For example, the following code would create a link between https://example.com/ and the image at https://example.com/image.jpg:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://example.com/</loc>
    <image:image>
      <image:loc>https://example.com/image.jpg</image:loc>
    </image:image>
  </url>
</urlset>

If not previously known, URLs would be created on the sitemap's bot level plus one. They will not get a user level at this point.

No link between the image and the sitemap itself is created.

Links extracted from sitemap extensions are shown in a separate table on the URL report of the sitemap and the live analysis link tab.

URL rewriting applies to both source and target for this kind of links.

Links extracted from <xhtml:link> are used for hreflang analysis like if they were defined on the HTML page itself.