Search Engines: costs vs. benefits
For a non-commercial information provider, search engine costs are the bandwidth generated on a site by search robots that
has to be paid for. Their benefits are the user page reads that result. Search bots have become so numerous in recent years
that they are a major drain on site resources while most provide almost no benefit in return. Identified bots accounted for 45%
of my site hit and bandwidth costs by July 2007. Yahoo is by far the worst bot abuser.
A solution available to all users is to use a robots.txt file to exclude all bots but Google. Bot hits decrease to a small fraction
of those to an unprotected site, and visitor count decreases only slightly.
Here were some of my search engine numbers when I started this project in July 2007:
|Active html pages: 293
Total page hits: 42870
User page views/mo (known bots removed): 23387
|search engine||bot hits/mo||user hits/mo
42% of my user views came from Google, and you can't beat their benefit:cost ratio. But all the other engines above except
Gigablast were a net loss to me.
How do you get rid of bots that make excessive requests? The simplest way is to set up a
robots.txt file in your root server directory containing:
to get rid of Yahoo's main agents. Repeat for other bots as appropriate. If, after reading this page, you want to get rid of the
whole lot except Google, as I have, use:
Data collection method
I've been archiving page logs since 1 March 2005. For each GET *.html entry in my logs, I search first for bots (case
insensitive) with strings similar to these:
Ask: teoma OR minefield
MSN: msnbot OR msrbot OR msiecrawler OR lanshanbot
Naver: naverbot OR yetibot
Yahoo: slurp OR yahooseeker OR Yahoo-MMCrawler
miscellaneous bots: bot.htm OR bot" OR crawl OR spider
Once these entries are removed, I search for a user hit from the engine:
Google: .google. OR earthlink OR aolsearch
Ichiro: .goo. OR .nttr.
MSN: msn.co OR search.live
Yahoo: yahoo OR alltheweb OR altavista.com
This method relies on the refer field of the user log. It has limitations: people can turn refer off in some browsers, proxy or
content filters can interfere with the refer field, and page caching reduces recorded user views but not bot hits. Another problem
with pinning down searchbot results is metasearchers who check several databases to come up with their results and give
preference to pages that are in many bases. A final problem is that some bots, open source ones in particular, are run
independently by many users; identification of user views is often not possible with these.
So, results of this method have to be investigated for consistency. To start, here are the referers that had an average of
10 bot+user hits/mo or more over the study period, in order of their total activity on my site:
- Google is so well known that its name has become a verb. Hitting
each of my pages an average of twice per month since Mar05, it delivered a consistent 45% of my user views. In addition, many
sites have a "powered by Google" search box. The Best Buy among robots!
- Yahoo is by far the worst bot abuser there is, many
months hitting all of my pages every day while delivering only 3% of my users. (Lanshanbot is their Chinese language bot.)
Turned off Aug07.
- Microsoft's search engine is proving far less popular than its operating systems,
delivering only 1% of my users while its
bots hit me 7 times that often. Turned off Sep07; turned on again med-Feb08 to separate its effects from Yahoo, turned off for
- Wikipedia delivered 11% of my page viewers. It doesn't use bots, but relies
on page writers to manually locate and enter external links related to their subject.
- Ask/Jeeves bot hits me ten times as often as its
users do, but contributes to several metasearchers.
- Voila has been hopefully checking 55% pages/mo for French content since Mar05 and
came up with only 10 isolated user hits the whole time, presumably from bilingual searchers. Turned off Jan08.
- Seekport is a German site that has been checked 50% pages/mo since Mar05, looking for pages in 7 European
languages, with precisely one user hit resulting over the entire period. It now seems to have vanished. It didn't obey robots.txt
- IRL-crawler is a Texas A&M research project studying the topology of
the Internet. It hit 37% pages/mo since Mar05 and delivers no users.
- Majestic is a distributed searcher designed for
users with broadband connections. It's been hitting 33% pages/mo since Mar05. User hits cannot be identified.
- Ichiro is a Japanese language searcher. It has hit 27% pages/mo since Mar05 and
has returned 10 hits over the entire period.
- Picsearch is an image searcher that hit 29% pages/mo since Apr05 to
identify them. Turned off Mar08.
- Lucene provides the open source searcher Nutch. The users who leave it in
default configuration have hit a total of 30% pages/mo since Mar05.
- Gigablast is a general search engine that hit 28%
pages/mo since Mar05. Its 2353 bot hits over the period don't quite match its 2067 user views.
- Twiceler was an experimental bot that has hit 43% pages/mo since
May05. It delivers no users, so was turned off Mar08. It seems to be gone now.
- Heritrix is the Internet Archive's web crawler, it's hit 20% pages/mo since Mar05.
- Discovery is a California company that provides no information about
itself. Its bot hit me 1400 times since Sep07; it seems to be gone now.
- Accoona was a general search engine that hit 19% pages/mo since Sep05
and returned 7 user views over the entire period.
- Naver is a Korean language searcher. It has hit 12% pages/mo since Mar05 and has
slowly increased its user views to the current 28/mo.
- TurnItIn provides anti-plagiarism services to educational
institutions. It has hit 16% pages/mo since Mar05 to collect data to support this. I support it too.
- Cazoodle appeared Nov06, it collects "information for
next-generation Web search and integration solutions". It hits 26% pages/mo and delivers no users.
- OmniExplorer has hit 15% pages/mo since Mar05. If you are selling things they list they might be
useful, otherwise they aren't. They stopped hitting me after I wrote for information.
- Shopwiki appeared in Jul06 and is building a shopper's index - if you
are selling things you want them, otherwise they are no use. They hit 18% pages/mo until I turned them off.
- Larbin is Linux freeware that anyone can
run. Fortunately, most people don't have the resources to do much damage with it; total hits since
Mar05 are 11% pages/mo. Identification of user hits is not possible.
- Convera is a specialist searcher provider for professionals. It appeared May05 and
hits 9% pages/mo.
- BruinBot is a project of the Computer Science department of UCLA. It hit 75%
- Krugle has searched for open source code for developers since Apr06, hitting 14%
pages/mo. If you have code to distribute, it's useful, otherwise it's no use.
- Sensis is an Australian searcher that has hit 8% pages/mo since Apr05, but
provided no user hits.
- FAST provided custom search solutions software. In default configuration its
users hit 10% pages/mo since Sep05. Default users are not identifiable.
- SevenTwentyFour provided link checking services to customers. It has hit
7% pages/mo since Mar05 and is now gone.
- Local.com specialises in searches for products and services in local areas in the
USA and UK. Since Sep05 it has hit 9% pages/mo. If you supply products or services in covered areas that's great, but since I
don't, it's useless.
- Factbites presents KWIC-type results. It has hit 10% pages/mo
since Sep06, but only delivered 15 hits over the whole period.
- Sproose is a searcher that has hit 9% pages/mo since Mar06. It features a user
voting system, and zero user views to my site.
- IBM's research crawler hit 10% pages/mo Mar05-Jan07.
- Ilial was a startup company that doesn't supply any users yet. It hit 8% pages/mo since Sep06 and now seems defunct.
- NetSeer appeared in Nov07 by hitting all of my pages twice, and said it was building
a marketing index, apparently to focus future ads. They have removed the info page on their crawler (12Dec07)
- Shim is a University of Tokyo research project. It hit 15%
- CounterStrike is an online computer game that hit 6% pages/mo
Jun05-Sep05; all are in effect user hits. It doesn't obey robots.txt.
- UBI is a project of the Nagaoka University of Technology, Japan, to survey the
use of languages on the internet. It hit all my pages Mar06 and Dec06.
- NetResearchServer is a customisable search engine sold to users. It hit 5%/mo of my pages Mar05-Sep07 and now seems
defunct. Identification of user hits is not possible.
- Scirus searches scientific journals online. It hit 75% of my pages Oct05 and
Feb06 and didn't find any.
- MSR-ISRCCrawler is a Microsoft bot that looks through music files for people who are looking for copyright violations. It
ignores robots.txt and has hit 20%/mo of pages since
There are hundreds of people who have set up open source crawlers under hopeful names, who have rapidly discovered the
huge resources needed for even the most focussed database, and who have vanished. Looking for crawl, bot.htm, bot" and
spider after all other checks gets almost all of them. Dumped into a file, a refer field sort enables new ones to be spotted and
Yahoo was disallowed Aug07 (the graph below shows why!), MSN Oct07.
The first result: site hit and bandwidth costs due to bots went from 45% in Jul07 to 27% over Nov07-Jan08. Here are the
expectations for Nov07-Jan08 from a line fit Mar05-Aug07 and the actual:
|total user hits||33910±1980||28300||-16.5
The expected loss in viewer hits from blocking Yahoo and MSN is 3.7%. The actual loss seems to be about 16% from the
total hit activity, 11% based on Google and 2% on Wikipedia. The mean of these three is 9.7%. The loss in user views seems
about 3x that expected.
Metasearch engines (ones that check many databases) require almost no investment to set up, so come and go. Dogpile is
the only significant one I have been able to identify on my site (0.2% of viewers). Only three of the 27 I have located hit more
than 100 pages each over the 34 month period; most had fewer than 10. They are minor players.
18% of user hits came from my own pages, so 84% of user hits have been traced to 5 refering sources. Possible modifications
of Yahoo+MSN refer fields can only account for ±0.6% of any change, so most people don't do it.
MSN was re-enabled mid-Feb08 to see if the effects of MSN and Yahoo could be separated. The result: all the unexpected loss
in user hits was due to Yahoo. It appears that only a third of user views due to Yahoo actually came from a Yahoo site or from
a metasearcher that says they use Yahoo. So, its actual benefit:cost is about 1:4.
As a final experiment, I allowed Google, Turnitin and archive.org and disallowed everything else mid Nov08 as described
above. The result - a gratifying loss of useless bot hits and almost no loss of users. I recommend it.
Unfortunately, this project was blown out of the water by a dishonest registrar who stole my domain name early November
2009. I hope the results to this point give you some useful information on how many bots there are out there that aren't worth
their keep, and how to deal with them.
user-agents.org are useful resources for
identifying searchbots, but havn't gone into search results.
has useful discussions on them.
Colossus has large lists of search
SES has an out-of-date note on the limited
coverage of individual engines.
other notes on computing