Link Crawling

State what gets crawled

The Audisto Crawler can be configured to consider or ignore several kinds of references during crawling. This is done using the Links setting when configuring a project or a crawl.

Configuration dialog for links

The crawl will always follow regular anchor links (the <a> tag), and redirects. Additionally, it can include resources, follow <link> tags and include links from XML sitemaps.

Crawling Resources

If Include resources is enabled, the crawler will download images, CSS, JavaScript and all kind of files linked through:

  • <img>, both the src and the srcset attribute are evaluated
  • <source>, both the src and the srcset attribute are evaluated
  • <script>
  • <frame>
  • <iframe>
  • <video>
  • <audio>
  • <object>
  • <link>, but only with a rel="stylesheet" attribute, defining a CSS resource

This allows for side-wide checking of images, scripts, and other resources.

If Include <link> elements is enabled, the crawler will discover documents and resources linked though a <link> tag. This includes canonical links and hreflang links. The Only exception are URLs with a rel="stylesheet" attribute. These are treated as resources.

Links are extracted from both

  • HTML
  • HTTP Header

Crawling XML Sitemaps

If Include links from XML sitemaps is enabled, the crawler will discover URLs and resources from XML sitemap index files and XML sitemap files.

Links from the following sitemap extensions are extracted as well:

If you are interested in less technical information, please read the dedicated page for our XML sitemap checker and validator.