URL rewriting is a powerful tool to
URL rewriting is done through a set of rules that can be entered when creating a crawl. To create a rewrite rule, first expand the advanced settings and scroll down to the URL Rewrite section.
Click the Add Rule button. This will add a new rule, which appears as a line of input elements.
You can click on Add Rule as many times as you want to create arbitrary many rules. Rules are later executed one after the other from top to bottom.
By clicking the Remove button right next to a rule, you can delete this rule.
A rewrite rule consists of three parts. These parts are:
These are the drop down boxes you may have already noticed. Matches and actions take parameters, so they are accompanied by a text input field. We will cover this later
Additionally, a rule can be given a name and a description. This is optional. These are the two text input fields forming the second line.
Note that neither name or description may contain line breaks.
Rewrite rules work against an URL. An URL itself consists of different parts. Scopes correspond to these parts.
By setting a scope you define what portion of the URL should be considered for matching against later on
Assuming an URL like
http://www.example.com/some/path?a=b&c=d#top, the parts are (from left to right):
/some/path- note the leading slash
a=b&c=d- without the question mark
top- without the hash
More generally an URL is build like this:
some parts may be omitted.
Scopes correspond to these parts either directly or indirectly. The following scopes are available:
/some/path- note the leading slash
a=b&c=d- note the question mark is not part of the query
/some/path?a=b&c=d- note the question mark is part of the file name
c=d- note both question mark and the ampersand are gone
Once the scope has been set, matches can be tested against it. This is done by matches. Matches are basically string comparisons against the scope.
The following matches are available:
Each match can also be negated by the according negative matches:
When defining a match - especially when using Starts With or Ends With - be sure what the scope contains, see the Pitfalls section later on. Also have in mind that regular expressions are powerful but may also be confusing. Don't hesitate to contact us, if you have any questions. We are always eager to help.
There are a lot of regular expression testers on the web, just search for "java regex tester". We particularly liked the one at http://java-regex-tester.appspot.com/.
Both Contains and Starts With do not support wildcards! Use regular expressions for that.
If there is a match against the scope, the defined action is triggered. The following actions exist:
The Regex Replace Match By supports
$0 for whole match
$2, etc for referencing patterns in regex.
For example the following rule would switch title and id inside a path,
If Path Matches Regex "^(.*?)/([a-z-]+)-(\d+)$" Then Replace Match By $1/$3.$2
Matches Regex can also be used with the Replace Match By action, which replaces the full match (that is
The following examples may illustrate common use cases for URL rewriting.
If your applications adds a session ID as part of the query like
and you want to get rid of it:
If Parameter Starts With "sid=" Then Drop Parameter sid
Note that using
sid= for matching prevents a match against other similar query parameters like e.g.
If your applications generates links using the HTTPS protocol, but this duplicates content:
If URL Starts With "https://" Then Replace Match By http://
If some host should be completely excluded from crawling, for example links to upload space used by forum users:
If Host Equals "upload.example.com" Then Ignore Page
When configuring a rule there are some mistakes that are easy be done yet difficult to spot. We will list some of them here.
We are always there to help, so just drop us a line if you are unsure about a rule.
When matching paths against Starts With or Equals, remember a path always starts with a slash.
If Path Starts With "users" Then Ignore Page
If Path Starts With "/users" Then Ignore Page Correct
If you want to drop a query parameter, you need to name it. This is often forgotten while matching against a parameter.
If Parameter Starts With "session=" Then Drop Parameter
If Parameter Starts With "session=" Then Drop Parameter "session" Correct
Matcher do not support wildcards, you must use regular expressions for that. Also note that an asterix (
does not represent a wildcard within a regular expression.
If Path Starts With "/users/*/profile" Then Ignore Page
If Path Matches Regex "^/users/.*/profile" Then Ignore Page Correct