How to write a good robots.txt

A robots.txt file is like a gatekeeper of your website, who lets some bots and web crawlers in and others not. A poorly written robots.txt can result in accessibility problems for crawlers and might result in a traffic drop.

In this guide we point out some common issues:

Basic Setting

Writing a robots.txt could be very easy if you don't forbid crawling and handle all robots the same way. This will allow all robots to crawl the site without restrictions:

User-agent: *
Disallow:

Problems start when things get more complex. For example you could address more than one robot, you could add comments and you could use extensions like crawl-delay or wildcards. Not all robots understand everything and this is where it gets really messy quick.

Blank Lines

The draft describes the file format like this:

The format logically consists of a non-empty set or records, separated by blank lines. The records consist of a set of lines of the form: <Field> ":" <value>

This means records are divided by blank lines and you are not allowed to have blank lines within a record.

User-agent: *

Disallow: /

If you strictly apply the draft, this would be interpreted as two records. Both records are incomplete the first one hast no rules, the second one has no useragent. Both sets could be ignored and would result in an empty robots.txt which effectivly is the same as:

User-agent: *
Disallow:

This is completely the opposite of what was intended. However some robots like Googlebot have a different approach to parse robots.txt files and would strip the empty lines and interpret it the way the webmaster probably ment it:

User-agent: *
Disallow: /

If you want to be save we highly recommend not to use blank lines within a record. By doing so more bots would interpret the robots.txt as it was ment.

On the other hand it is also bad if you do not have blank lines to seperate the records:

User-agent: a
Disallow: /path1/
User-agent: b
Disallow: /path2/

This is unconclusive. It could be interpreted as:

User-agent: a
Disallow: /path1/
Disallow: /path2/

or

User-agent: a
User-agent: b
Disallow: /path1/
Disallow: /path2/

The webmaster probably meant:

User-agent: a
Disallow: /path1/

User-agent: b
Disallow: /path2/

If you want to be save we highly recommend to split the records in this case.

Incomplete set of records

The draft describes a record like this:

The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot.

This means that a record consists of a User-agent line and a directive. Even if you want to allow crawling of your whole site you should add an instruction.

Instead of using:

User-agent: *

You shoud use:

User-agent: *
Disallow:

Comments

A lot of poorly programmed robots have problems with comments. The draft allows comments at the and of a line and as seperate lines.

# robots.txt version 1
User-agent: * # handle all bots
Disallow: /

There are several robots that totally mess up parsing this and handle it as:

User-agent: *
Disallow:

We highly recommend not to have any comments within a robots.txt so it is correctly handled by a larger number of robots.

Records with more than one user-agent

The draft allows to address multiple user-agents and have a set of rules that get applied to all of them:

User-agent: bot1
User-agent: bot2
Disallow: /

For several bots this syntax is too complex and we saw this interpreted as:

User-agent: bot1
Disallow:

User-agent: bot2
Disallow: /

or even worse

User-agent: bot1
Disallow:

User-agent: bot2
Disallow:

We highly recomment to adress each bot with a seperate record to increase the compatibility with poorly programmed robots. If you address more than one robot with a seperate set records you need to use blank lines to seperate the records.

Redirects to another host or protocol

The draft is very specific about redirects:

On server response indicating Redirection (HTTP Status Code 3XX) a robot should follow the redirects until a resource can be found.

However several robots have not implemented this at all. We highly recommend that you do not redirect to another location for the robots.txt. You should have a seperate robots.txt file on each host and for every protocol/port that is directly accessible because thats the range of validity.

Enhancements to the draft made by major search engines

In addition to the draft the major search engines agreed on several enhancements, as for example wildcards, crawl-delay and the possibility to refer to sitemaps.

The problem of course is that several robots don't have support for those enhancements and fail to handle them correctly:

User-agent: *
Disallow: /*/secret.html
Crawl-delay: 5

We highly recommend to use enhancements only in records for bots that can handle them to prevent parsing errors.

Wildcard enhancements

As mentioned above, wildcards are supported by major searchengines. Whenever you want to block access to urls with a specific suffix there is no way around them. We often see rulesets like this one:

User-agent: *
Disallow: .doc

It does not block access to .doc files because it is the same as

User-agent: *
Disallow: /.doc

and this blocks only access to urls starting with /.doc. To block access to all files ending with .doc you need to use:

User-agent: *
Disallow: /*.doc

When using wildcards to block parameters you should use something like this:

User-agent: *
Disallow /*?parameter1
Disallow /*&parameter1

otherwise the rule does not match when the parameter is not the first one:

http://www.example.com/?parameter2&parameter1

Order of records

The draft is very specific about in which order records are applied:

These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited.

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

However there are also a lot of bots that do not handle this correctly. We highly recommend to presort records and allow and disallow-lines to reduce problems during parsing.

Records never add up

Some webmasters seem to expect that records add up. This is fatal:

User-Agent: *
Disallow: /url/1

User-Agent: somebot
Disallow: /url/2

User-Agent: somebot
Crawl-delay: 5

does not result in "somebot" interpreting this as

User-Agent: somebot
Disallow: /url/1
Disallow: /url/2
Crawl-delay: 5

but would usually be interpreted as

User-Agent: somebot
Disallow: /url/2

You need to duplicate the rules if you want a bot to apply all of them.

Encoding of non US-ASCII characters

A lot of people seems to miss the fact that the path of a rule-line has a limited set of allowed character. Charactes that are not within the US-ASCII character set needs to be encoded. Instead of using

        User-agent: *
        Disallow: /ä

you should use the encoded version. This version can differ for different charsets. If you use UTF-8 you would use:

        User-agent: *
        Disallow: /%C3%A4

If you use ISO-8859-1 you would use:

        User-agent: *
        Disallow: /%E4

Because of the fact that is is likely that Users copy your URLs into other character sets you need to handle both URLs within your application and make sure to use both encoded versions to properly block them:

        User-agent: *
        Disallow: /%C3%A4
        Disallow: /%E4

File format, US-ASCII characters only

The draft is very clear about the file format. The file has to be plain text and there is a detailed BNF-like description. However we often see that people miss this part.

In a robots.txt file only US-ASCII characters are allowed. You are not even allowed to use non US-ASCII characters in the comments.

We highly recommend using only US-ASCII characters to avoid parsing problems.

Unicode Encoding and Byte Order Mark (BOM)

The draft does not specify a content encoding. We see a lot of different encodings for robots.txt files. However this might result in parsing problems, especially when the robots.txt file does contain non US-ASCII characters.

To avoid problems it is highly recommended to use plain text encoded in UTF-8 for the robots.txt file. This is also the file format google expects.

Sometime people also use a Byte Order Mark (BOM) at the beginning of the file and this could be a problem for robots.txt parsers. The Byte Order Mark is an optional, invisible Unicode character used to signal the Byte Order of a text file or stream.

We recommend not to use a Byte Order Mark to increase compatibility.

Filesize

Even big search engines don't handle big robots.txt files. Google for example has a limit of 500KB, others might have smaller limits. Keep this in mind and try to keep the size to a minimum.

Other problems

Some people also tend to have problems that occur because some robots show a specific behaviour.

When you deliver a robots.txt with an 503 statuscode, the robots of some major searchengines will stop to crawl the website. Even for a scheduled downtime it is a good Idea to keep your robots.txt with a 200 statuscode and only deliver 503 statuscodes for all the other urls. Changing the robots.txt to

        User-agent: *
        Disallow: /

is the worst thing you could do during a downtime and usually causes problems that last for quite some time.

Blocking urls that should not be indexed or crawled by a bot is also not a good idea. Searchengines tend to list known pages even after they were blocked. You should use noindex as a header or meta-tag to get the sites removed from the index first.

Some people also seem to think that you could have a different robots.txt for every directory. This is not the case!

The instructions must be accessible via HTTP [2] from the site that the instructions are to be applied to, as a resource of Internet Media Type [3] "text/plain" under a standard relative path on the server: "/robots.txt".

The robots.txt file needs to be placed within the root-directory of the host.

 

Author