URL Rewriting

Detect, modify or remove unwanted URLs right while crawling.

About URL Rewriting

URL rewriting is a powerful tool to

  • Ignore URLs
  • Get rid of a session ID
  • Fix broken paths e.g. missing trailing slashes
  • Drop parameters
  • ...and much more.

Creating Rewrite Rules

URL rewriting is done through a set of rules that can be entered when creating a crawl. To create a rewrite rule, first expand the advanced settings and scroll down to the URL Rewrite section.

The Add Rule button

Click the Add Rule button. This will add a new rule, which appears as a line of input elements.

You can click on Add Rule as many times as you want to create arbitrary many rules. Rules are later executed one after the other from top to bottom.

A line of input forming a rule

By clicking the Remove button right next to a rule, you can delete this rule.

Configuring Rewrite Rules

A rewrite rule consists of three parts. These parts are:

  • A scope
  • A match
  • An action

These are the drop down boxes you may have already noticed. Matches and actions take parameters, so they are accompanied by a text input field. We will cover this later

Additionally, a rule can be given a name and a description. This is optional. These are the two text input fields forming the second line.

Note that neither name or description may contain line breaks.

Scopes

Rewrite rules work against an URL. An URL itself consists of different parts. Scopes correspond to these parts.

By setting a scope you define what portion of the URL should be considered for matching against later on

Assuming an URL like http://www.example.com/some/path?a=b&c=d#top, the parts are (from left to right):

  • Protocol: http://
  • Host: www.example.com
  • Path: /some/path - note the leading slash
  • Query: a=b&c=d - without the question mark
  • Fragment: top - without the hash

More generally an URL is build like this: [Protocol][Host][Path]{?}[Query]{#}[Fragment], where some parts may be omitted.

Scopes correspond to these parts either directly or indirectly. The following scopes are available:

  • URL: The URL as a whole - http://www.example.com/some/path?a=b&c=d#top
  • Host: The host - www.example.com
  • Path: The path - /some/path - note the leading slash
  • Query: The query - a=b&c=d - note the question mark is not part of the query
  • File Name: Both path and query - /some/path?a=b&c=d - note the question mark is part of the file name
  • Parameter: Single elements of query - a=b and c=d - note both question mark and the ampersand are gone

Matches

Once the scope has been set, matches can be tested against it. This is done by matches. Matches are basically string comparisons against the scope.

The following matches are available:

  • Contains: Match is successful if the scope contains the desired string
  • Starts With: Match is successful if the scope starts with the desired string
  • Ends With: Match is successful if the scope ends with the desired string
  • Equals: Match is successful if the scope equals the desired string in a case sensitive manner
  • Matches Regex: Match is successful if the scope matches the given regular expression. We support Java style regular expressions

Each match can also be negated by the according negative matches:

  • Does Not Contain: Match is successful if the scope does not contain the desired string
  • Does Not Start With: Match is successful if the scope does not start with the desired string
  • Does Not End With: Match is successful if the scope does not end with the desired string
  • Does Not Equal: Match is successful if the scope does not equal the desired string in a case sensitive manner
  • Does Not Match Regex: Match is successful if the scope does not match the given regular expression.

When defining a match - especially when using Starts With or Ends With - be sure what the scope contains, see the Pitfalls section later on. Also have in mind that regular expressions are powerful but may also be confusing. Don't hesitate to contact us, if you have any questions. We are always eager to help.

There are a lot of regular expression testers on the web, just search for "java regex tester". We particularly liked the one at http://java-regex-tester.appspot.com/.

Both Contains and Starts With do not support wildcards! Use regular expressions for that.

Actions

If there is a match against the scope, the defined action is triggered. The following actions exist:

  • Replace Match By: Replaces the matched part with some string. This does nothing, if the match is negative, like in Does Not Contain.
  • Regex Replace Match By: Allows back referencing parts of a regular expression - this only works if the according matcher is itself a regular expression
  • Drop Parameter: Removes given parameter from the query. Name of parameter to drop is always required as argument.
  • Append To Path: Adds something to the path.
  • Ignore Page: The URL is completely ignored during the crawl. This action does not take an argument.

The Regex Replace Match By supports $0 for whole match and $1, $2, etc for referencing patterns in regex. For example the following rule would switch title and id inside a path, turning /shop/shoes/black-sneaker-1543 into /shop/shoes/1543.black-sneaker:

If Path Matches Regex "^(.*?)/([a-z-]+)-(\d+)$" Then Replace Match By $1/$3.$2

Matches Regex can also be used with the Replace Match By action, which replaces the full match (that is $0).

Examples

The following examples may illustrate common use cases for URL rewriting.

Remove Session ID in Query

If your applications adds a session ID as part of the query like ?sid=SOME_HASH, and you want to get rid of it:

If Parameter Starts With "sid=" Then Drop Parameter sid

Note that using sid= for matching prevents a match against other similar query parameters like e.g. sidir.

Always Force "http://"

If your applications generates links using the HTTPS protocol, but this duplicates content:

If URL Starts With "https://" Then Replace Match By http://

Ignore Host

If some host should be completely excluded from crawling, for example links to upload space used by forum users:

If Host Equals "upload.example.com" Then Ignore Page

Common Pitfalls

When configuring a rule there are some mistakes that are easy be done yet difficult to spot. We will list some of them here.

We are always there to help, so just drop us a line if you are unsure about a rule.

Paths Always Start With a /

When matching paths against Starts With or Equals, remember a path always starts with a slash.

If Path Starts With "users" Then Ignore Page Wrong
If Path Starts With "/users" Then Ignore Page Correct

Drop Parameter Need a Parameter Name

If you want to drop a query parameter, you need to name it. This is often forgotten while matching against a parameter.

If Parameter Starts With "session=" Then Drop Parameter Wrong
If Parameter Starts With "session=" Then Drop Parameter "session" Correct

No Wildcards

Matcher do not support wildcards, you must use regular expressions for that. Also note that an asterix (*) does not represent a wildcard within a regular expression.

If Path Starts With "/users/*/profile" Then Ignore Page Wrong
If Path Matches Regex "^/users/.*/profile" Then Ignore Page Correct