When we crawl large websites we sometimes see structural issues that are so dominant, that it imposssible to crawl the whole site without fixing those issues first. Important architectural questions can't be analyzed any more in such an situation and often it takes quite some time until the issues get fixed.
To work around this kind of blockers we recently introduced the URL-Rewriting feature which is available in the ultimate-edition.
With url rewriting it is possible to ignore urls, get rid of sessionids, fix broken path e.g. missing trailing slashes, drop parameters and much more.
Here is an example:
With the feature enabled you can specify rules to rewrite urls when you setup a crawl. The feature allows you to apply rules on:
You can check if those
You can choose to:
Most of the options are straight forward. For the use of regular expressions check the Java documentation.