What is a robots.txt File and How to Use it

This article includes general information about a robots.txt file, fundamentals of robots.txt syntax, and examples of usage. It also includes robots.txt and SEO, removing exclusions of images, adding a reference to your sitemap.xml file, and miscellaneous remarks. In this article we will also discuss robots.txt for WordPress website, blocking main WordPress Directories, and duplicate content issues in WordPress web hosting.

General Information About robots.txt

Robots.txt is a text file located in the site’s root directory that manages search engines’ crawlers and spiders which website pages and files you want them to crawl. Generally, website owners try and work hard to be noticed by search engines. However, sometimes it is not required. For example, if you save confidential data or you are trying to save bandwidth by not indexing excluding heavy pages with images.

When a crawler visits a website, it needs a file named '/robots.txt' in the first place. If this file is found, the crawler checks it whether indexation of a website is asked or not.

NOTE: There can be no more than one robots.txt file for a single website. A robots.txt file for an addon domain requires being placed in the corresponding document root.

Google's authorized posture on the robots.txt file

A robots.txt file consists of lines which contain two fields: one line with a user-agent name and one or several lines starting with the directive.

Disallow:

Robots.txt must be created in the UNIX text format.

Basics of robots.txt syntax

Usually, a robots.txt file comprises something like given below:

   User-agent: *

   Disallow: /cgi-bin/

   Disallow: /tmp/

   Disallow: /~different/

In this example three directories: '/cgi-bin/', '/tmp/' and '/~different/' are omitted from Google indexation.

NOTE: Each directory is written on a distinct line. You cannot write 'Disallow: /CGI-bin/ /tmp/' in one line, nor can you break up one directive Disallow or User-agent into several lines. You should use a new line to separate directives from each other.

Some of common mistakes are typos - misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. It is easier for an error to slip in when your robots.txt files get more and more complicated. There are some authentication tools that come in handy: http://tool.motoricerca.info/robots-checker.phtml

Examples of usage

Here are some suitable examples of robots.txt usage:

Avoid the full website from indexation by all web crawlers:

   User-agent: *

   Disallow: /

Let all the web crawlers to index the complete website:

   User-agent: *

   Disallow:

Prevent only several directories from indexation:

   User-agent: *

   Disallow: /CGI-bin/

Prevent the site’s indexation by a specific web crawler:

   User-agent: Bot1

   Disallow: /

Search for a list with all user-agents’ names divided into categories here.

Allow indexation to a specific web crawler and prevent indexation from others:

   User-agent: Opera 9

   Disallow:

   User-agent: *

   Disallow: /

Avoid all the files from indexation except a single one.

This is quite difficult as the directive 'Allow' does not exist. Instead, you can move all the files to a certain subdirectory and prevent its indexation except one file that you allow to be indexed:

  User-agent: *

  Disallow: /docs/

Here you may also utilize an online robots.txt file generator here.

Robots.txt and SEO

Removing exclusion of images

The default robots.txt file in some CMS versions is set up to exclude your images folder. This problem does not occur in the latest CMS versions. However, the older versions need to be checked.

Adding reference to your sitemap.xml file

If you have a sitemap.xml file and you must have it as it increases your SEO rankings. It will be good to add the following line in your robots.txt file:

   sitemap:http://www.domain.com/sitemap.xml

Miscellaneous comments

  • You are advised not to block CSS, Javascript or any other resource files by default. This blockage prevents Googlebot from properly translating the page and understanding that your website is mobile friendly.
  • You can also use the file to prevent specific pages from being indexed. For example a log in- or 404-pages, however, this is done well using the robots meta tag.
  • As a rule of thumb, the robots.txt file should never be used to deal with same content. There are some other good ways to do that like a Rel=canonical tag which is a part of the HTML head of a web page.
  • Always remember that robots.txt is not understandable. There are always some other tools at your disposal that can do a more better job like the parameter handling tools within Google and Bing Webmaster Tools. The x-robots-tag and the meta robots tag.

 Robots.txt for WordPress

WordPress creates a virtual robots.txt file when you publish your first post with WordPress. However, if you have already created a real robots.txt file on your server, WordPress would not add a virtual one.

A virtual robots.txt does not exist on the server. You can only access it via the following link: http://www.yoursite.com/robots.txt

By default, it will have Google’s Mediabot allowed. A bunch of spambots disallowed and some standard WordPress folders and files disallowed.

In case you did not create a real robots.txt yet, craft one with any text editor and upload it to the root directory of your server via FTP.

Blocking Main WordPress Directories

There are 3 standard directories in every WordPress installation. These directories are wp-content, wp-admin, wp-includes that do not need to be indexed.

Do not select to disallow the whole wp-content folder though, as it contains an 'uploads' subfolder with your site’s media files that you would not want to be blocked. That is why you should continue as follows:

   Disallow: /wp-admin/

   Disallow: /wp-includes/

   Disallow: /wp-content/plugins/

   Disallow: /wp-content/themes/

Blocking on the basis of the structure of your website

Every blog can be structured in various ways:

a)      On the basis of categories

b)      On the basis of tags

c)       On the basis of both - none of those

d)      On the basis of date-based archives

a) If your website is category-structured, you do not need to have the Tag archives indexed. Find your tag base in the Permalinks options page under the Settings menu. If the field is left blank, the tag base is simply 'tag':

   Disallow: /tag/

b) In case that your website is tag-structured then you need to block the category archives. Search for your category base and use the following directive:

   Disallow: /category/

c) In case you use categories and tags then you do not need to use any directives. In case you use none of them then you need to block both of them:

   Disallow: /tags/

   Disallow: /category/

d) In case your website is well structured and arranged on the basis of date-based archives, you can block those in the following ways:

   Disallow: /2010/

   Disallow: /2011/

   Disallow: /2012/

   Disallow: /2013/

NOTE: You cannot use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number '20'.

Duplicate content problems in WordPress

By default, WordPress hosting has matching pages which are not in your favor for your SEO rankings. To deal with it, we suggest that you do not use robots.txt but instead go with a subtle way: the 'rel = canonical' tag that you use to place the only correct canonical URL in the section of your site. In this way, web crawlers will only crawl the canonical version of a page.

If you need any further help you may contact Sky Host support team to get resolved your issue.