URL Rewriting

Detect, Modify or Remove Unwanted URLs During Crawling

About URL Rewriting

URL Rewriting is a powerful tool to

  • Ignore URLs
  • Get rid of a session ID
  • Fix broken paths e.g. missing trailing slashes
  • Drop parameters
  • ...and much more.

Creating Rewrite Rules

URL Rewriting is done through a set of rules that can be entered when creating a crawl. To create a rewrite rule, navigate to the URL Rewrite tab.

The Add Rule button

Click the Add Rewrite Rule button. This will add a new rewrite rule, which appears as a set of input elements.

You can click on Add Rewrite Rule as many times as you want to create arbitrary many rules. Rules are later executed one after another from top to bottom.

A set of inputs forming a rule

By clicking the Remove button right next to a rule, you can delete this rule.

Configuring Rewrite Rules

A rewrite rule consists of three parts. These parts are:

These are the drop down boxes you may have already noticed. Matches and actions take parameters, so they are accompanied by a text input field. We will cover this later.

Additionally, a rewrite rule can be given a name and a description. This is optional. These are the two text input fields at the top.

Note that the name may not contain line breaks.

A filled rewrite rule may look like this:

A set of inputs forming a rule, filled in

Actions

If there is a match against the scope, the defined action is triggered. The following actions exist:

  • Replace Match By: Replaces the matched part with some text.
  • Regex Replace Match By: Allows back referencing parts of a regular expression - this only works if the according matcher is itself a regular expression
  • Drop Parameter: Removes given parameter from the query. Name of parameter to drop is always required as argument.
  • Append To Path: Adds something to the path.
  • Ignore URL: The URL is completely ignored during the crawl. This action does not take an argument.

The Regex Replace Match By supports $0 for the whole match and $1, $2, etc for referencing capture groups defined in the expression. For example the following rule would switch title and id inside a path, turning /shop/shoes/black-sneaker-1543 into /shop/shoes/1543.black-sneaker:

RULE IS:
  If Path Matches Regex "^(.*?)/([a-z-]+)-(\d+)$" Then
  Regex Replace Match By "$1/$3.$2"

Matches Regex can also be used with the Replace Match By action, which replaces the full match (that is $0).

Examples

The following examples may illustrate common use cases for URL Rewriting.

Remove Session ID in Query

If your applications adds a session ID as part of the query like ?sid=SOME_HASH, and you want to get rid of it:

RULE IS:
  If Parameter Starts With "sid=" Then
  Drop Parameter "sid"

Note that using sid= for matching prevents a match against other similar query parameters like e.g. sidir.

Always Force "https://"

If your applications generates links using the HTTPS protocol, but this duplicates content:

RULE IS:
  If URL Starts With "http://" Then
  Replace Match By "https://"

Ignore Host

If some host should be completely excluded from crawling, for example links to upload space used by forum users:

RULE IS:
  If Host Equals "upload.example.com" Then
  Ignore URL

Common Pitfalls

When configuring a rule there are some mistakes that are easy be done yet difficult to spot. We will list some of them here.

We are always there to help, so just drop us a line if you are unsure about a rule.

Path Always Starts With /

When matching paths against Starts With or Equals, remember a path always starts with a slash.

Wrong: If Path Starts With "users" Then Ignore URL
Correct: If Path Starts With "/users" Then Ignore URL

Drop Parameter Requires a Parameter Name

If you want to drop a query parameter, you need to name it. This is often forgotten while matching against a parameter.

Wrong: If Parameter Starts With "session=" Then Drop Parameter
Correct: If Parameter Starts With "session=" Then Drop Parameter "session"

No Wildcards

Matcher do not support wildcards, you must use regular expressions for that. Also note that an asterix (*) does not represent a wildcard within a regular expression. You may use .*? for that.

Wrong: If Path Starts With "/users/*/profile" Then Ignore URL
Correct: If Path Matches Regex "^/users/.*?/profile" Then Ignore URL