Welcome to 1st Webmaster Free Resources
- About WWW robots
- Indexing robots
- For Server Administrators
- Robots exclusion standard
About Web Robots
A robot is a program that automatically traverses the Web's hypertext
structure by retrieving a document, and recursively retrieving all
documents that are referenced.
Note that "recursive" here doesn't limit the definition to any specific
traversal algorithm; even if a robot applies some heuristic to the
selection and order of documents to visit and spaces out requests
over a long space of time, it is still a robot.
Normal Web browsers are not robots, because they are operated by a human,
and don't automatically retrieve referenced documents (other than
Web robots are sometimes referred to as Web Wanderers, Web Crawlers,
or Spiders. These names are a bit misleading as they give the impression
the software itself moves between sites like a virus; this not the case,
a robot simply visits sites by requesting documents from them.
The word "agent" is used for lots of meanings in computing these days.
are programs that do travel between sites, deciding
themselves when to move and what to do.
These can only travel between special servers and are currently not
widespread in the Internet.
are programs that help users with things, such as
choosing a product, or guiding a user through form filling, or even
helping users find things. These have generally little to do with
is a technical name for programs that perform networking
tasks for a user, such as Web User-agents like Netscape Navigator and
Microsoft Internet Explorer, and
Email User-agent like Qualcomm Eudora etc.
A search engine is a program that searches through some dataset. In the
context of the Web, the word "search engine" is most often used for search
forms that search through databases of HTML documents gathered by a robot.
Robots can be used for a number of purposes:
list of active robots
to see what robot does what.
Don't ask me -- all I know is what's on the list...
- HTML validation
- Link validation
- "What's New" monitoring
They're all names for the same sort of thing, with slightly different
- the generic name, see above.
- same as robots, but sounds cooler in the press.
- same as robots, although technically a worm is a replicating program,
unlike a robot.
- Web crawlers
- same as robots, but note WebCrawler
is a specific robot
- distributed cooperating robots.
There are a few reasons people believe robots are bad for the Web:
But at the same time the majority of robots are well designed,
professionally operated, cause no problems, and provide a valuable service
in the absence of widely deployed better solutions.
Certain robot implementations can (and have in the past) overloaded
networks and servers. This happens especially with people who are
just starting to write a robot; these days there is sufficient
information on robots to prevent some of these mistakes.
- Robots are operated by humans, who make mistakes in configuration,
or simply don't consider the implications of their actions.
This means people need to be careful, and robot authors need to make
it difficult for people to make mistakes with bad effects
- Web-wide indexing robots build a central database of documents,
which doesn't scale too well to millions of documents on millions
So no, robots aren't inherently bad, nor inherently brilliant,
and need careful attention.
A few others can be found on the
The Software Agents Mailing List FAQ
Internet Agents: Spiders, Wanderers, Brokers, and Bots
by Fah-Chun Cheong.
- This books covers Web robots, commerce transaction agents,
Mud agents, and a few others. It includes source code for a
simple Web robot based on top of libwww-perl4.
Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be
a "how to write a web robot" book, but it provides useful background
reading and a good overview of the state-of-the-art, especially if you
haven't got the time to find all the info yourself on the Web.
Published by New Riders,
Bots and Other Internet Beasties
by Joseph Williams
I haven't seen this myself, but someone said:
The William's book 'Bots and other Internet Beasties' was quite
disappointing. It claims to be a 'how to' book on writing robots,
but my impression is that it is nothing more than a collection
of chapters, written by various people involved in this area and
subsequently bound together.
Published by Sam's, ISBN: 1-57521-016-9
Client Programming with Perl by Clinton Wong
This O'Reilly book is planned for Fall 1996, check the
O'Reilly Web Site for
the current status. It promises to be a practical book,
but I haven't seen it yet.
There is a Web robots home page on:
Of course the latest version of this FAQ is there.
You'll also find details and an archive of the robots mailing
list, which is intended for technical discussions about robots.
This depends on the robot, each one uses different strategies.
In general they start from a historical list of URLs, especially
of documents with many links elsewhere, such as server lists,
"What's New" pages, and the most popular sites on the Web.
Most indexing services also allow you to submit URLs manually,
which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners
through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit
and index, and to parse and use as a source for new URLs.
If an indexing robot knows about a document, it may decide to
parse it, and insert it into its database. How this is done
depends on the robot: Some robots index the HTML
Titles, or the first few paragraphs, or parse the entire
HTML and index all words, with weightings depending on HTML
constructs, etc. Some parse the META tag, or other special
We hope that as the Web evolves more facilities becomes available
to efficiently associate meta data such as indexing information
with a document. This is being worked on...
You guessed it, it depends on the service :-) Many services have
a link to a URL submission form on their search page, or have more
information in their help pages. For example, Google has
Information for Webmasters.
For Server Administrators
You can check your server logs for sites that retrieve many
documents, especially in a short time.
If your server supports User-agent logging you can check for
retrievals with unusual User-agent header values.
Finally, if you notice a site repeatedly checking for the file
'/robots.txt' chances are that is a robot too.
Well, nothing :-) The whole idea is they are automatic; you don't
need to do anything.
If you think you have discovered a new robot (ie one that is not
the list of active robots, and it does more than sporadic visits,
drop me a line so I can make a note of it for future reference.
But please don't tell me about every robot that happens to drop by!
This is called "rapid-fire", and people usually notice it if they're
monitoring or analysing an access log file.
First of all check if it is a problem by checking the load of your server,
and monitoring your servers' error log, and concurrent connections if
you can. If you have a medium or high performance server, it is quite
likely to be able to cope a high load of even several requests per second,
especially if the visits are quick.
However you may have problems if you have a low performance site, such as
your own desktop PC or Mac you're working on, or you run low performance
server software, or if you have many long retrievals (such as CGI scripts
or large documents). These problems manifest themselves in refused
connections, a high load, performance slowdowns, or in extreme cases a
If this happens, there are a few things you should do. Most importantly,
start logging information: when did you notice, what happened, what do
your logs say, what are you doing in response etc; this helps investigating
the problem later. Secondly, try and find out where the robot came from,
what IP addresses or DNS domains, and see if they are mentioned in the
list of active robots. If you can identify a site this way, you can
email the person responsible, and ask them what's up. If this doesn't help,
try their own site for telephone numbers, or mail postmaster at their
If the robot is not on the list, mail me with all the information you
have collected, including actions on your part. If I can't help, at least
I can make a note of it for others.
Read the next section...
Robots exclusion standard
They are probably from robots trying to see if you have specified
any rules for them using the Standard
for Robot Exclusion, see also below.
If you don't care about robots and want to prevent the messages
in your error logs, simply create an empty file called robots.txt
in the root level of your server.
Don't put any HTML or English language "Who the hell are you?"
text in it -- it will probably never get read by anyone :-)
The quick way to prevent robots visiting your site is put these
two lines into the /robots.txt file on your server:
but its easy to be more selective than that.
You can read the whole standard specification
but the basic concept is simple: by writing a structured text
file you can indicate to robots that certain parts of your
server are off-limits to some or all robots. It is best explained
with an example:
The first two lines, starting with '#', specify a comment
# /robots.txt file for http://webcrawler.com/
# mail email@example.com for constructive criticism
The first paragraph specifies that the robot called 'webcrawler'
has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called 'lycra'
has all relative URLs starting with '/' disallowed.
Because all relative URL's on a server start with '/',
this means the entire site is closed off.
The third paragraph indicates that all other robots should not
visit URLs starting with /tmp or /log. Note the '*' is a special
token, meaning "any other User-agent"; you cannot use wildcard
patterns or regular expressions in either User-agent or Disallow
Two common errors:
Wildcards are _not_ supported: instead of
'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
You shouldn't put more than one path on a Disallow line (this may
change in a future version of the spec)
Probably... there are some ideas floating around. They haven't
made it into a coherent proposal because of time constraints,
and because there is little pressure. Mail suggestions to the
robots mailing list, and check the robots home page for work
Sometimes you cannot make a /robots.txt file, because you don't
administer the entire server. All is not lost: there is a new
for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
in your HTML document, that document won't be indexed.
<META NAME="ROBOTS" CONTENT="NOINDEX">
If you do:
the links in that document will not be parsed by the robot.
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
Some people are concerned that listing pages or directories in the
/robots.txt file may invite unintended access. There are two ansers to
The first answer is a workaround: You could put all the files you
don't want robots to visit in a separate sub directory, make that
directory un-listable on the web (by configuring your server), then
place your files in there, and list only the directory name in
the /robots.txt. Now an ill-willed robot can't traverse that directory
unless you or someone else puts a direct link on the web to one of
your files, and then it's not /robots.txt fault.
For example, rather than:
and make a "norobots" directory, put foo.html and bar.html into it,
and configure your server to not generate a directory listing for that
directory. Now all an attacker would learn is that you have a
"norobots" directory, but he won't be able to list the files in there;
he'd need to guess their names.
However, in practice this is a bad idea -- it's too fragile. Someone
may publish a link to your files on their site. Or it may turn up in a
publicly accessible log file, say of you user's proxy server, or maybe
it will show up in someone's web server log as a Referer. Or someone
may misconfigure your server at some future date, "fixing" it to show
a directory listing. Which leads me to the real answer:
The real answer is that /robots.txt is not intended for access
control, so don't try to use it as such. Think of it as a "No Entry"
sign, not a locked door. If you have files on your web site that you
don't want unauthorized people to access, then configure your server
to do authentication, and configure appropriate authorization. Basic
Authentication has been around since the early days of the web (and in
e.g. Apache on UNIX is trivial to configure), and if you're really
serious, SSL is commonplace in web servers.
If you mean a search service, check out the various directory pages
on the Web, such as
Exploring the Net
or try one of the Meta search services such as
Well, you can have a look at the list of robots; I'm starting to
indicate their public availability slowly.
In the meantime, two indexing robots that you should be able to
get hold of are Harvest (free), and Verity's.
See above -- some may be willing to give out source code.
Alternatively check out the libwww-perl5 package, that has a simple example.
Lots. First read through all the stuff on
the robot page
then read the proceedings of past WWW Conferences, and the
complete HTTP and HTML spec. Yes; it's a lot of work :-)
Simply fill in a form you can find on
The Web Robots Database
Here we have collected in one place the best FREE Webmaster Software, FREE Webmaster Services, and FREE Webmaster Resources on the World Wide Web.
Check out the navigation on the right. Each page is full of FREE Software, FREE Online Services and FREE links that will help maintain, develop, improve and promote your website Free!
Looking for WS FTP or Free FTP Software?
Looking for Irfan View or Free Graphics Software?
Looking for Free Site Templates?
Looking for free Promotion and Submission Software?
Looking for free Animated Gifs?
Looking for free Graphics?
Looking for free Logos?
Looking for free Banners?
Looking for free HTML Editors?
Looking for free Perl, CGI, or Java?
Looking for free Counters?
Looking for free Guest Books?
They are all here! Free Webmaster Resources
Just SOME of the free resources we have listed on this FREE web site:
WS FTP or Free FTP Software free software Irfan View Free Graphics Software Free Site Templates free Promotion and Submission Software free Animated Gifs free Graphics free Logos free Banners free HTML Editors free Perl, CGI, or Java free Counters free Guest Books free graphics software webmaster resources freestuff for webmasters free stuff for webmasters free submission services free seo free search engine promotion search engines search engines list free promotion service free promotion free promotion screensavers free scrensavers software free software free submission software free submition software free graphics free animated gifs free gifs gifworks free icons free editors free html editors wsftp wsftple ws_ftp_le free ftp software cuteftp cute ftp free ftp software free SEO services freebies free logos free blazing logos free animated banners free banners ukfreestuff ukfreebies uk free stuff uk freebies freeware freeware software freeware files free downloads free games freeware games free web templates freesitetemplates free site templates jimworld jimsworld free softwares free guestbooks free counters the counter free stats add free stats free guest books free gnu freeware free scripts free perl scripts free cgi scripts free java scripts free PERL free CGI free webmaster resources free webmaster stuff freebies for webmasters search engines big search engine list free search engine submit 1st Webmaster - your first stop for Free Webmaster Resources. Free FTP software, free graphics, free logos, free templates, free promotion, free SEO, free optimization, free submission services. Free search and replace software bk replace em bkreplacem bkreplaceem replacem replaceem replace em webmasters webmaster gogle googel googel search engine free ftp downloads free ftp software irfanview
1stWebmaster and 1stWebmasters are SM trademarks