Search Engine Tricks
  Home arrow Search Engine Tricks arrow Robots, Agents and Spiders - Identifying Sear...
Affiliate Promotion  
Blog Help  
Domain Name Tips  
How To  
Newsletter Marketing  
Online Business Help  
Search Engine Tricks  
Web Development  
Web Hosting  
Website Advertising  
Website Content  
Website Marketing  
 Webmaster Tools
 
Base64 Encoding 
Browser Settings 
CSS Coder 
CSS Navigation Menu 
Datetime Converter 
DHTML Tooltip 
Dig Utility 
DNS Utility 
Dropdown Menu 
Fetch Content 
Fetch Header 
Floating Layer 
htaccess Generator 
HTML to PHP 
HTML Encoder 
HTML Entities 
IP Convert 
Meta Tags 
Password Encryption
 
Password Strength
 
Pattern Extractor 
Ping Utility 
Pop-Up Window 
Regex Extractor 
Regex Match 
Scrollbar Color 
Source Viewer 
Syntax Highlighting 
URL Encoding 
Web Safe Colors 
Whois
 
Forums Sitemap 
Mobile Linux 
APP Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
SEARCH ENGINE TRICKS

Robots, Agents and Spiders - Identifying Search Engine Crawlers
By: Developer Shed
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating:  stars stars stars stars stars / 0
    2003-08-09

    Table of Contents:

    Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Robots, Agents and Spiders - Identifying Search Engine Crawlers

    If you've been surfing search engine optimization web sites, you've no doubt come across the above being mentioned on many occasions.

    Crawlers, Agents, Bots, Robots and Spiders
    Five terms all describing basically the same thing, but in this article they'll be referred to collectively as spiders or "agents". A search engine spider is an automated software program used to locate and collect data from web pages for inclusion in a search engine's database and to follow links to find new pages on the World Wide Web. The term "agent" is more commonly applied to web browsers and mirroring software.

    If you've ever examined your server logs or web site traffic reports, you've probably come across some weird and wonderful names for search engine spiders, including "Fluffy the Spider" and Slurp. Depending upon the type of web traffic reports you receive, you may find spiders listed in the "Agents" section of your statistics.

    Not all spiders are good
    Who actually owns these spiders? It's good to know the beneficial from the bad. Some agents are generated by software such as Teleport Pro, an application that allows people to download a full "mirror" of your site onto their hard drives for viewing later on, or sometimes for more insidious purposes such as plagiarism. If you have a large or image heavy site, the practice of web site stripping could also have a serious impact on your bandwidth usage each month.

    Banning spiders and agents
    If you notice entries like Teleport Pro and WebStripper in your traffic reports, someone's been busy attempting to download your web site. You don't have to just sit back and let this happen. If you are commercially hosted, you'll be able to add a couple of lines to your robots.txt file to prevent repeat offenders from stripping your site.

    The robots.txt file gives search engine spiders and agents direction by informing them what directories and files they are allowed to examine and retrieve. These rules are called The Robots Exclusion Standard.

    To prevent certain agents and spiders from accessing any part of your web site, simply enter the following lines into the robots.txt file:

    User-agent: NameOfAgent
    Disallow: /

    Ensure that you enter the name of the agent exactly as it appeared in your reports/logs e.g. Teleport Pro/1.29 and that there is a separate entry for each agent. Skip a line between entries. You could do the same to exclude search engine spiders, but somehow I don't think you'll really want to do this :0). The "/" in the above example means disallow access to any directory. You can also disallow access by spiders and agents to certain directories e.g.

    User-agent: *
    Disallow: /cgi-bin/

    In this example the asterisk (wildcard) indicates "all". Don't use the asterisk in the Disallow statement to indicate "all", use the forward slash instead.

    If you don't have a robots.txt file, create one in notepad and upload it to the docs directory (or the root of whichever directory your web pages are stored in). Never use a blank robots.txt file as some search engines may see this as an indication that you don't want your site spidered at all! Have at least one entry in the file.

    Unfortunately, defining web stripper agents and spiders in your robots.txt file won't work in all cases as some mirroring software applications have the ability to mimic web browser identifiers; but at least it's some protection that may save you some valuable bandwidth.

    If you're not able to create a robots.txt file, which is usually the case if you are hosted by a free hosting service, this tool may be useful:

    Search engine spider identification
    The following is a basic listing of search engine spider names and their "owners". This is by no means complete, as there are many thousands of search engines on the Internet, but it covers the more common beneficial spiders. Look for these in your traffic reports or search for the names through your server logs to discover which pages they have been spidering. You'll find that many of the entries will also have accompanying numbers or letters e.g Googlebot/2.1 or Slurp.so/1.0

    Spider name 

    Spider owner

    Googlebot  Google.com 
    TeomaAgent  Teoma.com 
    Zyborg  Wisenut.com 
    Gulliver  NorthernLight.com
    Architext spider  Excite.com 
    FAST-WebCrawler  FAST (AllTheWeb.com) 
    Slurp  Inktomi.com 
    Ask Jeeves  AskJeeves.com
    ia_archiver  Alexa.com
    Scooter  AltaVista.com 
    Mercator  AltaVista.com
    crawler@fast   FAST (AllTheWeb.com)
    Crawler  Crawler.de 
    InfoSeek sidewinder  InfoSeek.com 
    Lycos_Spider_(T-Rex)  Lycos.com 
    Fluffy the Spider   SearchHippo.com
    Ultraseek  InfoSeek.com
    MantraAgent  LookSmart.com
    Moget  Goo.jp
    T-H-U-N-D-E-R-S-T-O-N-E  Thunderstone.com
    MuscatFerret  Euroferret.com
    VoilaBot  Voila.fr
    Sleek Spider  Search-info.com
    KIT_Fireball  FireBall.de
    WebCrawler  Webcrawler.com

     

    If you have spotted any significant activity from these spiders in your reports or logs, there's a good chance that you'll be listed on that particular search engine. But you'll need to be patient; some Search Engines take up to 6 months to refresh their databases!

    Copyright information....
    This article is free for reproduction but must be reproduced in its entirety & this copyright statement must be included. Visit
    http://www.tamingthebeast.net to view great articles, tutorials and tools for site owners, web developers and Internet marketers! Subscribe for free to our popular ecommerce/web design ezine!


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

    More Search Engine Tricks Articles
    More By Developer Shed

     

    IBM® developerWorks developerWorks - FREE Tools!


    NEW! Best practices for software analysis: An introduction to the IBM Rational Software Analyzer application

    This whitepaper presents the benefits of successfully introducing static analysis into your organization using IBM Rational Software Analyzer. Additionally, it identifies some common pitfalls that can hinder the effective use of static analysis tooling as well as presents 10 simple strategies designed to help you quickly realize the value of static analysis using Rational Software Analyzer.
    FREE! Go There Now!


    NEW! Download a free trial of Lotus Quickr 8.0

    Visit IBM developerWorks to download a free trial version of Lotus Quickr 8.0, which enables collaboration by transforming the way everyday business content such as documents, rich media, photos, and video can be shared. Lotus Quickr makes it faster and easier to share content of all types (not just documents) within virtual teams. It is designed to make it easier to collaborate across organizational boundaries, while continuing to work within the context of familiar desktop applications.
    FREE! Go There Now!


    NEW! Download IBM WebSphere Portal V6.1 beta code

    Download the IBM WebSphere Portal V6.1 beta code and learn more about the rich features and enhancements in IBM WebSphere Portal V6.1. WebSphere Portal provides a composite application or business mashup framework and the advanced tooling needed to build flexible, SOA-based solutions, and scalability to meet the needs of any size organization.
    FREE! Go There Now!


    NEW! IBM Enterprise Modernization Sandbox for System z

    IBM Enterprise Modernization solutions help organizations evolve core IT systems towards modern architectures and technologies—reducing the burden of maintenance and freeing up resources to develop new business requirements and capabilities. With the IBM Enterprise Modernization Sandbox for System z you can evaluate IBM Enterprise Modernization solutions focused on five key areas: Assets, Architectures, Skills, Processes and Infrastructures, and Investment. Each solution is based upon real customer experiences and offers a proven path to get you started with your modernization projects.
    FREE! Go There Now!


    NEW! Innovate don't duplicate! Asset reuse strategies for success

    Asset Reuse is a key strategy for companies looking to create innovative solutions to solve complex software development problems. Searching for, identifying, updating, using and deploying software assets can be a difficult challenge. Listen to this webcast, to learn about strategies and tools that you can leverage for a successful project, including Rational Asset Manager, Rational Software Architect and WebSphere Service Registry and Repository.
    FREE! Go There Now!


    NEW! Successful Change and Release Management for .NET

    Join this webcast to discover the key requirements for successful change and release management. Learn how to extend your .NET environment to improve productivity and collaboration, and address core problems afflicting team development. In this webcast, we’ll review typical challenges faced by customers and how to resolve them with the IBM Rational Change and Release Management solution, including Rational ClearCase, Rational ClearQuest and Rational Build Forge. Replay is available for 9 months.
    FREE! Go There Now!


    NEW! Webcast: Calling All Testers! Find Application Vulnerabilities Early in the Development Process Where they are Easier to Fix and Less Risky to your Business

    In this webcast, IBM Rational will discuss the importance of Web application security and will share techniques and best practices to introduce application security testing into current QA processes including: understanding common security vulnerabilities and techniques to integrate security testing with defect tracking and remediation systems in an effort to safeguard sensitive online information.
    FREE! Go There Now!


    NEW! Webcast: Quickly provide customized, integrated user interfaces with Lotus Notes 8

    IBM Lotus Notes 8 provides a wide range of developers the ability to provide customized, integrated user interfaces via composite applications and via custom sidebar and toolbar plug-ins. This webcast provides you with tips and techniques to use with out-of-the-box capabilities of Lotus Notes 8, and survey how you can share useful components within your own company and within a larger community.
    FREE! Go There Now!


    NEW! Whitepaper: Achieving consistency between business process models and operational guides

    Explore how Rational and WebSphere software enable enterprise documentation in SOA environments. Specifically, a new integration between IBM WebSphere® Business Modeler and IBM Rational® Method Composer software can help technical writers more easily keep enterprise operations manuals in sync with changes that are made to business processes, resulting in more accurate and timely documentation that benefits the entire enterprise.
    FREE! Go There Now!


    Refresh! IBM Rational Systems Development Solution eKit

    With IBM Rational Systems Development Solution, you can deliver products faster with higher quality. Within this kit, Read the “Model Driven Systems Development” white paper to see how to improve product quality and communication. Then check out the rest of the e-Kit to learn more about important topics that can affect the success of any software project through customer examples, tutorials, informative Webcasts, and best practices for designing, building and managing systems. From start to finish, at every stage in your projects, Rational Systems Development Solution can help your company reach its full potential.
    FREE! Go There Now!



    All FREE IBM® developerWorks Tools!

       

    SEARCH ENGINE TRICKS ARTICLES

    - Search Engine Nightmares: Grammatical Errors...
    - Identifying Keywords
    - Crafting Perfect Keyword Phrases
    - Why Are Search Engines So Popular?
    - Write SEO-Perfect Articles
    - What Does Google Want?
    - Can`t Find the Right Keywords?
    - A Guide to Spamdexing
    - Make it Searchable
    - Search Engine Optimization (SEO) in Internet...
    - Google Adsense - Ads That Make You Money!
    - A Hard Look at PPC, Click Fraud and the Alte...
    - The Net`s New Information Highway
    - Gerrymandering The Google Search Results
    - Dispelling Fears About The GoogleBomb Algori...

     
    Create the Optimal Architecture for your Critical Applications
    Warburton's the largest independently owned bakery in the UK faced a number of d....

     
    Five Best Practices for Deploying a Successful Service-Oriented Architecture
    This white paper describes the benefits you can expect with SOA, and how IBM can....

     
    Gartner Magic Quadrant for Application Delivery Controllers
    Gartner summarizes its view on Application Delivery Controllers, evaluates stren....

     
    Knowledge is Power
    What you don't know can hurt you, and is likely costing you money and increasing....

     
    Rationalizing the Multi-Tool Environment
    The rationalized multi-tool approach is flexible, scalable and cost effective. I....

     




    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 4 Hosted by Hostway
    Stay green...Green IT