What Are Robots, Spiders, And Crawlers?
You should already have a general understanding that a robot, spider, or crawler is a piece of software that is programmed to “crawl” from one web page to another based on the links on those pages. As this crawler makes it way around the Internet, it collects content (such as text and links) from web sites and saves those in a database that is indexed and ranked according to the search engine algorithm.
When a crawler is first released on the Web, it’s usually seeded with a few web sites and it begins on one of those sites. The first thing it does on that first site is to take note of the links on the page. Then it “reads” the text and begins to follow the links that it collected previously. This network of links is called the crawl frontier; it’s the territory that the crawler is exploring in a very systematic way.
The links in a crawl frontier will sometimes take the crawler to other pages on the same web site, and sometimes they will take it away from the site completely. The crawler will follow the links until it hits a dead end and then backtrack and begin the process again until every link on a page has been followed. Figure 16-1 illustrates the path that a crawler might take.
As to what actually happens when a crawler begins reviewing a site, it’s a little more complicated than simply saying that it “reads” the site. The crawler sends a request to the web server where the web site resides, requesting pages to be delivered to it in the same manner that your web browser requests pages that you review. The difference between what your browser sees and what the crawler sees is that the crawler is viewing the pages in a completely text interface. No graphics or other types of media files are displayed. It’s all text, and it’s encoded in HTML. So to you it might look like gibberish.
The crawler can request as many or as few pages as it’s programmed to request at any given time. This can sometimes cause problems with web sites that aren’t prepared to serve up dozens of pages of content at a time. The requests will overload the site and cause it to crash, or it can slow down traffic to a web site considerably, and it’s even possible that the requests will just be fulfilled too slowly and the crawler will give up and go away.
If the crawler does go away, it will eventually return to try the task again. And it might try several times before it gives up entirely. But if the site doesn’t eventually begin to cooperate with the crawler it’s penalized for the failures and your site’s search engine ranking will fall.
In addition, there are a few reasons you may not want a crawler indexing a page on your site:
■Your page is under construction. If you can avoid it, you don’t want a crawler to index your site while this is happening. If you can’t avoid it, however, be sure that any pages that are being changed or worked on are excluded from the crawler’s territory. Later, when your page is ready, you can allow the page to be indexed again.
■ Pages of links. Having links leading to and away from your site is an essential way to ensure that crawlers find you. However, having pages of links seems suspicious to a search crawler, and it may classify your site as a spam site. Instead of having pages that are all links, break links up with descriptions and text. If that’s not possible, block the link pages from being indexed by crawlers.
■ Pages of old content. Old content, like blog archives, doesn’t necessarily harm your search engine rankings, but it also doesn’t help them much. One worrisome issue with archives, however, is the number of times that archived content appears on your page. With a blog, for example, you may have the blog appear on the page where it was origi- nally displayed, and also have it displayed in archives, and possibly have it linked from some other area of your site. Although this is all legitimate, crawlers might mistake multi- ple instances of the same content for spam. Instead of risking it, place your archives off limits to crawlers.
■ Private information. It really makes better sense not to have private information (or pro- prietary information) on a web site. But if there is some reason that you must have it on your site, then definitely block crawlers from access to it. Better yet, password-protect the information so that no one can stumble on it accidently
There’s a whole host of reasons you may not want to allow a crawler to visit some of your web pages. It’s just like allowing visitors into your home. You don’t mind if they see the living room, dining room, den, and maybe the kitchen, but you don’t want them in your bedroom without good reason. Crawlers are the guests in your Internet house. Be sure they understand the guide- lines under which they are welcome.