Try to control which bots and spiders crawl your site

Published on: - 3 years, 4 months ago

An article tagged as: apache, web server

Twitter Google+ Facebook Reddit

Overview

A robots.txt file is an important tool when optimising a website. It tells web robots and spiders which areas of a site they may access.

What are web robots?

Web robots, also known as crawlers or spiders, are programs that index information on the internet. Some are sent by search engines like Google, and some are sent by devious programmers to harvest information such as email addresses.

Create a robots.txt file

Create a text file and name it robots.txt. Enter the following into your robots.txt file:-

User-agent: *
Disallow:

User-agent refers to the web robot the rules that follow apply to. The * wildcard character means all web robots.

Disallow refers to the area of the site robots should not index.

So the above robots.txt file means all robots are allowed everywhere

Restrict access

Apply rules to all robots

User-agent: *
Disallow: /

The above means all robots are not allowed anywhere

User-agent: *
Disallow: /dev/
Disallow: /uat/

The above means all robots are allowed everywhere except in /dev/ and /uat/

Apply rules to specific robots

User-agent: BadRobot
Disallow: /

The above means BadRobot is not allowed anywhere

Please note that there is no guarantee that a web robot will adhere to the rules within your robots.txt file

Upload robots.txt

Once complete, upload the robots.txt file to the root directory of your website. This is normally the directory that your home page is in.

References

This article was possible thanks to robotstxt.org