What's The Robot Exclusion Standards?

Work on algorithm in the vintage computer lab

What’s The Robot Exclusion Standards?

 

Because they do have the potential to wreak havoc on a web site, there has to be some kind of guidelines to keep crawlers in line. Those guidelines are called the Robot Exclusion Standard, Robots Exclusion Protocol, or robots.txt.

 

The file robots.txt is the actual element that you’ll work with. It’s a text-based document that should be included in the root of your domain, and it essentially contains instructions to any crawler that comes to your site about what they are and are not allowed to index.

 

To communicate with the crawler, you need a specific syntax that it can understand. In its most basic form, the text might look something like this:

 

User-agent: * Disallow: /

 

These two parts of the text are essential. The first part, User-agent:, tells a crawler what user agent, or crawler, you’re commanding. The asterisk (*) indicates that all crawlers are covered, but you can specify a single crawler or even multiple crawlers.

 

The second part, Disallow:, tells the crawler what it is not allowed to access. The slash (/) indi- cates “all directories.” So in the preceding code example, the robots.txt file is essentially saying that “all crawlers are to ignore all directories.”

 

When you’re writing robots.txt, remember to include the colon (:) after the User-agent indicator and after the Disallow indicator. The colon indicates that important information follows to which the crawler should pay attention.

 

You won’t usually want to tell all crawlers to ignore all directories. Instead, you can tell all crawlers to ignore your temporary directories by writing the text like this:

 

User-agent: *

Disallow: /tmp/

 

Or you can take it one step further and tell all crawlers to ignore multiple directories:

 

User-agent: *

Disallow: /tmp/

Disallow: /private/

Disallow: /links/listing.html

 

That piece of text tells the crawler to ignore temporary directories, private directories, and the web page (title Listing) that contains links-the crawler won’t be able to follow those links:

 

One thing to keep in mind about crawlers is that they read the robots.txt file from top to bottom and as soon as they find a guideline that applies to them, they stop reading and begin crawling your site. So if you’re commanding multiple crawlers with your robots.txt file, you want to be careful how you write it.

 

This is the wrong way:

User-agent: * Disallow: /tmp/

 

User-agent: CrawlerName

Disallow: /tmp/

Disallow: /links/listing.html

 

This bit of text tells crawlers first that all crawlers should ignore the temporary directories. So every crawler reading that file will automatically ignore the temporary files. But you’ve also told a specific crawler (indicated by CrawlerName) to disallow both temporary directories and the links on the Listing page. The problem is, the specified crawler will never get that message because it has already read that all crawlers should ignore the temporary directories.

 

 

If you want to command multiple crawlers, you need to first begin by naming the crawlers you want to control. Only after they’ve been named should you leave your instructions for all crawlers. Written properly, the text from the preceding code should look like this:

 

 

NOTE

 

User-agent: CrawlerName Disallow: /tmp/

Disallow: /links/listing.html

User-agent: *

Disallow: /tmp/

 

 

If you have certain pages or links that you want the crawler to ignore, you can accomplish this without causing the crawler to ignore a whole site or a whole directory or having to

put a specific meta tag on each page.

 

 

Each search engine crawler goes by a different name, and if you look at your web server log, you’ll probably see that name. Here’s a quick list of some of the crawler names that you’re likely to see in that web server log:

 

 

■Google: Googlebot

■MSN: MSNbot

■Yahoo! Web Search: Yahoo SLURP or just SLURP

■Ask: Teoma

■AltaVista: Scooter

■LookSmart: MantraAgent

■WebCrawler: WebCrawler

■SearchHippo: Fluffy the Spider

 

 

These are just a few of the search engine crawlers that might crawl across your site. You can find a complete list along with the text of the Robots Exclusion Standard document on the Web Robots Pages (www.robotstxt.org). Take the time to read the Robots Exclusion Standard document. It’s not terribly long, and reading it will help you understand how search crawlers interact with your web site. That understanding can also help you learn how to control crawlers better when. they come to visit.

 

It pays to know which crawler belongs to what search engine, because there are some spambots and other malicious crawlers out there that are interested in crawling your site for less than ethical rea- sons. If you know the names of these crawlers, you can keep them off of your site and keep your users’ information safe. Spambots in particular are troublesome, because they crawl along the Web searching out and collecting anything that appears to be an e-mail address. These addresses are then collected and sold to marketers or even people who are not interested in legitimate business oppor- tunities. Most spambots will ignore your robots.txt file.

 

 

TIP

 

You can view the robots.txt file for any web site that has one by adding the robots .txt extension to the base URL of the site. For example, www.sampleaddress.com/ robots.txt will display a page that shows you the text file guiding robots for that site. If you use that extension on a URL and it doesn’t pull up the robots.txt file, then the web site does not have one.

 

 

If you don’t have a robots.txt file, you can create one in any text editor. And keep in mind that not everyone wants or needs to use the robots.txt file. If you don’t care who is crawling your site, then don’t even create the file. Whatever you do, though, don’t use a blank robots.txt file. Crawlers auto- matically assume an empty file means you don’t want your site to be crawled. So using the blank file is a good way to keep yourself out of search engine results.

Share This Blog

Article

Latest Articles