How to Control Search Engine Robots - You can deter...
(Page 3 of 4 )
You can deter crawlers from indexing the 'duplicate' directory by typing this into your robots.txt file. Or if you would like to have the robots.txt file created for you, visit www.rietta.com/robogen. To validate your robots.txt file to make sure it works properly you can visit www.searchengineworld.com/cgi-bin/robotcheck.cgi
User-agent: *
Disallow: /duplicate/
The * after user-agent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each user-agent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create the above two commands into a robots.txt file:
# this identifies the wayback machine
User-agent: ia_archiver
Disallow: /
User-agent: *
Disallow: /duplicate/
One thing to note that is very important: Anyone can access the robots.txt file of a site. So if you have information that you don't want anyone to see don't include it into the robots.txt file. If the directory that you don't want anyone to see is not linked to from your web site the crawlers won't index it anyway.
An alternative to blocking indexing of your site is to put a meta tag into the page. It looks like this: meta name="robots" content="noindex,nofollow"
You put this into the head tag of your web page. This line tells the robot crawlers not to index (search) the page and not to follow any of the hyperlinks on the page. So as an example meta name="robots" content="noindex,follow" tells the robots crawlers to not index the page, but follow the hyperlinks on this page.
Next: Did you know... >>
More Search Engine Tricks Articles
More By Jase Dow