About URL Rewriting
URL Rewriting is a powerful tool to
- Ignore URLs
- Block URLs from crawling
- Get rid of a session ID
- Fix broken paths e.g. missing trailing slashes
- Drop parameters
- ...and much more.
Creating Rewrite Rules
URL Rewriting is done through a set of rules that can be entered when creating a crawl. To create a rewrite rule, navigate to the URL Rewrites tab.
Click the Add Rewrite button. This will add a new rewrite rule, which appears as a set of input elements.
You can click on Add Rewrite as many times as you want to create arbitrary many rules. Rules are later executed one after another from top to bottom.
By clicking the Remove button right next to a rule, you can delete this rule.
Configuring Rewrite Rules
A rewrite rule consists of three parts. These parts are:
- A scope. Only URL based scopes are supported.
- A matcher. Only text related matchers are supported.
- An action
These are the drop down boxes you may have already noticed. Matches and actions take parameters, so they are accompanied by a text input field. We will cover this later.
Additionally, a rewrite rule can be given a name and a description. This is optional. These are the two text input fields at the top.
Note that the name may not contain line breaks.
A filled rewrite rule may look like this:
If there is a match against the scope, the defined action is triggered. The following actions exist:
- Replace Match By: Replaces the matched part with some text.
- Regex Replace Match By: Allows back referencing parts of a regular expression - this only works if the according matcher is itself a regular expression
- Drop Parameter: Removes given parameter from the query. Name of parameter to drop is always required as argument, but using an asterix (*) as wildcard is supported.
- Append To Path: Adds something to the path.
- Ignore URL: The URL is completely ignored during the crawl. This action does not take an argument.
- Do Not Crawl: The URL will not be crawled, but still be listed within the crawl. This action does not take an argument.
The Regex Replace Match By supports
$0 for the whole match
$2, etc for referencing capture groups defined in the expression.
For example the following rule would switch title and id inside a path,
RULE IS: IF Path Matches Regex """^(.*?)/([a-z-]+)-(\d+)$""" THEN Regex Replace Match By "$1/$3.$2"
Matches Regex can also be used with the Replace Match By action, which replaces the full match (that is
The following examples may illustrate common use cases for URL Rewriting.
Remove Session ID in Query
If your applications adds a session ID as part of the query like
and you want to get rid of it:
RULE IS: IF Parameter Starts With "sid=" THEN Drop Parameter "sid"
Note that using
sid= for matching prevents a match against other similar query parameters like e.g.
Always Force "https://"
If your applications generates links using the HTTPS protocol, but this duplicates content:
RULE IS: IF Scheme Equals "http" THEN Replace Match By "https"
If some host should be completely excluded from crawling, for example links to upload space used by forum users:
RULE IS: IF Host Equals "upload.example.com" THEN Ignore URL
When configuring a rule there are some mistakes that are easy be done yet difficult to spot. We will list some of them here.
We are always there to help, so just drop us a line if you are unsure about a rule.
Path Always Starts With a Slash
When matching paths against Starts With or Equals, remember a path always starts with a slash (
Wrong: IF Path Starts With "users" THEN Ignore URL Correct: IF Path Starts With "/users" THEN Ignore URL
Drop Parameter Requires a Parameter Name
If you want to drop a query parameter, you need to name it. This is often forgotten while matching against a parameter.
Wrong: IF Parameter Starts With "session=" THEN Drop Parameter Correct: IF Parameter Starts With "session=" THEN Drop Parameter "session"
Matcher do not support wildcards, you must use the
Is Like matcher or regular expressions
for that. Also note that an asterix (
does not represent a wildcard within a regular expression. You may use
.*? for that.
Wrong: IF Path Starts With "/users/*/profile" THEN Ignore URL Correct: IF Path Is Like "/users/*/profile*" THEN Ignore URL Correct: IF Path Matches Regex """^/users/.*?/profile""" THEN Ignore URL