URL rewriting - modify or ignore URLs found during a crawl

When we crawl large websites we sometimes see structural issues that are so dominant, that it imposssible to crawl the whole site without fixing those issues first. Important architectural questions can't be analyzed any more in such an situation and often it takes quite some time until the issues get fixed.

To work around this kind of blockers we recently introduced the URL-Rewriting feature which is available in the ultimate-edition.

With url rewriting it is possible to ignore urls, get rid of sessionids, fix broken path e.g. missing trailing slashes, drop parameters and much more.

Here is an example:

URL rewriting

With the feature enabled you can specify rules to rewrite urls when you setup a crawl. The feature allows you to apply rules on:

  • URL
  • Path
  • Query
  • Parameter
  • Filename

You can check if those

  • Contain a string
  • Start with a string
  • Match a regular Expression

You can choose to:

  • Replace the match
  • Regex replace the match
  • Drop Parameter
  • Append to path
  • Ignore the page

Most of the options are straight forward. For the use of regular expressions check the Java documentation.

If you need any assistance with setting up the rules you can have a look at our detailed help section about url rewriting or just contact us.