Robots, Agents and Spiders - Identifying Search Engine Crawlers
If you've been surfing search engine optimization web sites, you've no doubt come across the above being mentioned on many occasions.
Crawlers, Agents, Bots, Robots and Spiders
Five terms all describing basically the same thing, but in this article they'll be referred to collectively as spiders or "agents". A search engine spider is an automated software program used to locate and collect data from web pages for inclusion in a search engine's database and to follow links to find new pages on the World Wide Web. The term "agent" is more commonly applied to web browsers and mirroring software.
If you've ever examined your server logs or web site traffic reports, you've probably come across some weird and wonderful names for search engine spiders, including "Fluffy the Spider" and Slurp. Depending upon the type of web traffic reports you receive, you may find spiders listed in the "Agents" section of your statistics.
Not all spiders are good
Who actually owns these spiders? It's good to know the beneficial from the bad. Some agents are generated by software such as Teleport Pro, an application that allows people to download a full "mirror" of your site onto their hard drives for viewing later on, or sometimes for more insidious purposes such as plagiarism. If you have a large or image heavy site, the practice of web site stripping could also have a serious impact on your bandwidth usage each month.
Banning spiders and agents
If you notice entries like Teleport Pro and WebStripper in your traffic reports, someone's been busy attempting to download your web site. You don't have to just sit back and let this happen. If you are commercially hosted, you'll be able to add a couple of lines to your robots.txt file to prevent repeat offenders from stripping your site.
The robots.txt file gives search engine spiders and agents direction by informing them what directories and files they are allowed to examine and retrieve. These rules are called The Robots Exclusion Standard.
To prevent certain agents and spiders from accessing any part of your web site, simply enter the following lines into the robots.txt file:
User-agent: NameOfAgent
Disallow: /
Ensure that you enter the name of the agent exactly as it appeared in your reports/logs e.g. Teleport Pro/1.29 and that there is a separate entry for each agent. Skip a line between entries. You could do the same to exclude search engine spiders, but somehow I don't think you'll really want to do this :0). The "/" in the above example means disallow access to any directory. You can also disallow access by spiders and agents to certain directories e.g.
User-agent: *
Disallow: /cgi-bin/
In this example the asterisk (wildcard) indicates "all". Don't use the asterisk in the Disallow statement to indicate "all", use the forward slash instead.
If you don't have a robots.txt file, create one in notepad and upload it to the docs directory (or the root of whichever directory your web pages are stored in). Never use a blank robots.txt file as some search engines may see this as an indication that you don't want your site spidered at all! Have at least one entry in the file.
Unfortunately, defining web stripper agents and spiders in your robots.txt file won't work in all cases as some mirroring software applications have the ability to mimic web browser identifiers; but at least it's some protection that may save you some valuable bandwidth.
If you're not able to create a robots.txt file, which is usually the case if you are hosted by a free hosting service, this tool may be useful:
Search engine spider identification
The following is a basic listing of search engine spider names and their "owners". This is by no means complete, as there are many thousands of search engines on the Internet, but it covers the more common beneficial spiders. Look for these in your traffic reports or search for the names through your server logs to discover which pages they have been spidering. You'll find that many of the entries will also have accompanying numbers or letters e.g Googlebot/2.1 or Slurp.so/1.0
Spider name | Spider owner |
| Googlebot | Google.com |
| TeomaAgent | Teoma.com |
| Zyborg | Wisenut.com |
| Gulliver | NorthernLight.com |
| Architext spider | Excite.com |
| FAST-WebCrawler | FAST (AllTheWeb.com) |
| Slurp | Inktomi.com |
| Ask Jeeves | AskJeeves.com |
| ia_archiver | Alexa.com |
| Scooter | AltaVista.com |
| Mercator | AltaVista.com |
| crawler@fast | FAST (AllTheWeb.com) |
| Crawler | Crawler.de |
| InfoSeek sidewinder | InfoSeek.com |
| Lycos_Spider_(T-Rex) | Lycos.com |
| Fluffy the Spider | SearchHippo.com |
| Ultraseek | InfoSeek.com |
| MantraAgent | LookSmart.com |
| Moget | Goo.jp |
| T-H-U-N-D-E-R-S-T-O-N-E | Thunderstone.com |
| MuscatFerret | Euroferret.com |
| VoilaBot | Voila.fr |
| Sleek Spider | Search-info.com |
| KIT_Fireball | FireBall.de |
| WebCrawler | Webcrawler.com |
If you have spotted any significant activity from these spiders in your reports or logs, there's a good chance that you'll be listed on that particular search engine. But you'll need to be patient; some Search Engines take up to 6 months to refresh their databases!
Copyright information....
This article is free for reproduction but must be reproduced in its entirety & this copyright statement must be included. Visit http://www.tamingthebeast.net to view great articles, tutorials and tools for site owners, web developers and Internet marketers! Subscribe for free to our popular ecommerce/web design ezine!
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
More Search Engine Tricks Articles
More By Developer Shed
developerWorks - FREE Tools! |
CakePHP is a stable production-ready, rapid-development aid for building Web sites in PHP. This "Cook up Web sites fast with CakePHP" series shows you how to build an online product catalog using CakePHP. FREE! Go There Now!
|
|
|
|
Download a free trial version of IBM Rational Developer for System z, software that can help you deliver core development capabilities; the power of Java Platform, Enterprise Edition (Java EE); and rapid application development support to diverse enterprise application development teams. With comprehensive development tools to help create, deploy and maintain traditional enterprise and composite applications, Rational Developer for System z enables developers with different technical backgrounds to easily participate in important technology projects. FREE! Go There Now!
|
|
|
|
Manage, govern, and share services across your organization by using WebSphere Service Registry and Repository. Follow the hands-on exercises to learn how to navigate the Web interface to publish, find, reuse, and update services. FREE! Go There Now!
|
|
|
|
Learn how to do more with your reusable assets with the free Rational Asset Manager eKit. The eKit includes demos on how Rational Asset Manager tracks and audits your assets in order to utilize them for reuse. Plus you’ll find white papers and a Webcast that discuss the challenges of a Service Oriented Architecture and how Rational Asset Manager can provide quick and effective solutions. FREE! Go There Now!
|
|
|
|
Join this Rational Talks to You teleconference on December 4 at 1:00 pm ET to discuss how Rational Method Composer can help meet your compliance objectives. Get your questions answered! FREE! Go There Now!
|
|
|
|
Join this webcast to discover the key requirements for successful change and release management. Learn how to extend your .NET environment to improve productivity and collaboration, and address core problems afflicting team development. In this webcast, we’ll review typical challenges faced by customers and how to resolve them with the IBM Rational Change and Release Management solution, including Rational ClearCase, Rational ClearQuest and Rational Build Forge. Replay is available for 9 months. FREE! Go There Now!
|
|
|
|
As businesses grow increasingly dependent upon Web applications, these complex entities grow more difficult to secure. Most companies equip their Web sites with firewalls, Secure Sockets Layer (SSL), and network and host security, but the majority of attacks are on applications themselves – and these technologies cannot prevent them. This paper explains what you can do to help protect your organization, and it discusses an approach for improving your organization’s Web application security. FREE! Go There Now!
|
|
|
|
Whether you are creating new applications or modifying existing ones, managing integration of new components with traditional z/OS elements is a critical part of building and deploying modern applications. Listen to this webcast to see how IBM can help you optimize your development process using an IDE like Rational Developer for System z that integrates with management tools, such as ClearCase to manage your application development on mainframes. FREE! Go There Now!
|
|
|
|
In this webcast, IBM Rational will discuss the importance of Web application security and will share techniques and best practices to introduce application security testing into current QA processes including: understanding common security vulnerabilities and techniques to integrate security testing with defect tracking and remediation systems in an effort to safeguard sensitive online information. FREE! Go There Now!
|
|
|
|
User communities play an important role in communication and collaboration around products, solutions and other areas of special interest to members. Successful communities are able to provide the right mix of content and services to deliver a value proposition that resonates with each audience. Join Tom Inman, VP of Marketing for Information and Platform Solutions as he introduces the new LeverageINFORMATION community. During this webcast, learn about the value provided by the community and how customers and partners derive value from the community in addressing their own technical and business challenges. FREE! Go There Now!
|
|
|
|
All FREE IBM® developerWorks Tools! |