URL Rewriting is a powerful tool to
URL Rewriting is done through a set of rules that can be entered when creating a crawl. To create a rewrite rule, navigate to the URL Rewrite tab.
Click the Add Rewrite Rule button. This will add a new rewrite rule, which appears as a set of input elements.
You can click on Add Rewrite Rule as many times as you want to create arbitrary many rules. Rules are later executed one after another from top to bottom.
By clicking the Remove button right next to a rule, you can delete this rule.
A rewrite rule consists of three parts. These parts are:
These are the drop down boxes you may have already noticed. Matches and actions take parameters, so they are accompanied by a text input field. We will cover this later.
Additionally, a rewrite rule can be given a name and a description. This is optional. These are the two text input fields at the top.
Note that the name may not contain line breaks.
A filled rewrite rule may look like this:
If there is a match against the scope, the defined action is triggered. The following actions exist:
The Regex Replace Match By supports
$0 for the whole match
$2, etc for referencing capture groups defined in the expression.
For example the following rule would switch title and id inside a path,
RULE IS: If Path Matches Regex """^(.*?)/([a-z-]+)-(\d+)$""" Then Regex Replace Match By "$1/$3.$2"
Matches Regex can also be used with the Replace Match By action, which replaces the full match (that is
The following examples may illustrate common use cases for URL Rewriting.
If your applications adds a session ID as part of the query like
and you want to get rid of it:
RULE IS: If Parameter Starts With "sid=" Then Drop Parameter "sid"
Note that using
sid= for matching prevents a match against other similar query parameters like e.g.
If your applications generates links using the HTTPS protocol, but this duplicates content:
RULE IS: If URL Starts With "http://" Then Replace Match By "https://"
If some host should be completely excluded from crawling, for example links to upload space used by forum users:
RULE IS: If Host Equals "upload.example.com" Then Ignore URL
When configuring a rule there are some mistakes that are easy be done yet difficult to spot. We will list some of them here.
We are always there to help, so just drop us a line if you are unsure about a rule.
When matching paths against Starts With or Equals, remember a path always starts with a slash.
Wrong: If Path Starts With "users" Then Ignore URL Correct: If Path Starts With "/users" Then Ignore URL
If you want to drop a query parameter, you need to name it. This is often forgotten while matching against a parameter.
Wrong: If Parameter Starts With "session=" Then Drop Parameter Correct: If Parameter Starts With "session=" Then Drop Parameter "session"
Matcher do not support wildcards, you must use the
Is Like matcher or regular expressions
for that. Also note that an asterix (
does not represent a wildcard within a regular expression. You may use
.*? for that.
Wrong: If Path Starts With "/users/*/profile" Then Ignore URL Correct: If Path Is Like "/users/*/profile*" Then Ignore URL Correct: If Path Matches Regex """^/users/.*?/profile""" Then Ignore URL