| |
Robots & Spider Crawls-
Search engines like Google develop software programs that are
designed to “crawl” the millions of websites and web pages that
comprise the internet. These programs are known as “spiders” or
“crawlers.” Since we often refer the the internet as the web, the
term spider seemed a good fit to describe programs that crawl
through it's millions of pages.
How Spiders Crawl:
When a new website is created, the webmaster submits the website
address to Google and other search engines like Yahoo! and MSN. The
website is added to a list of new sites that crawlers or robots will
visit. It typically takes several weeks for the spider to pay its
initial visit to a newly created website. When the spider reaches
the website, it automatically navigates through the site, for
keywords and tagging like meta tags, navigating the inbound and
outbound links and the various components of the site. As a spider
or robot visits and “crawls” through the website, the software is
essentially forming a snapshot of the website and all its individual
web pages at that particular point in time. That snapshot or
“memory” of the website and its individual web pages is cached or
“filed.”
Create a Robot Txt. File:
Creating a robots.txt file will not improve your search engine
positioning, but it does provide robots with information concerning
which files you will not allow to be crawled and indexed in the
search engines. When a robot crawls your site it looks for the
robots.txt file. If it doesn't find one it assumes automatically
that it may crawl and index the entire site. Not having a robots.txt
file can also create unnecessary 404 errors in your server logs,
making it more difficult to track "real" 404 errors. Assuming you
want your entire site indexed and only want to stop the unnecessary
404 errors from occurring you should upload a simple robots.txt file
to the root directory of your domain. The simplest robots.txt file
uses two rules:
- User-agent: (which is the robot the following rule applies to)
- Disallow: (the URL you want to block)
This code allows all robots to crawl all files:
User-agent: *
Disallow:
Add a Robot Txt. File:
Simply create a text document and save the new document as "robots.txt". Do not use a html editor to create the file unless is has the ability to create a plain text document (ASCII). Most computers will allow you to create a text document using notepad.
Open Notepad and insert instructions to robots (see code
above). Click on "save as", then save document as robots.txt
Then simply upload the file to your website. Since I use FrontPage Extensions, after I have made the file I click on "save as" and save the robot.txt file directly into my website folder, then the next time I publish the folder is published online. |
|
  |
| |
 |
| |
|