« Firefox Got Yahoo! | Main | You find the strangest things on video search... »
Yahoo! Search Tips for Webmasters: Saving Bandwidth
If you run a public webserver, you have likely seen our webcrawler, named Slurp, in your logs. Its job is to find, fetch, and archive all of the page content that is fed into the Yahoo! Search engine. We continuously improve our crawler to pick up new pages and changes of your sites, but the flip side is that our crawler will use up some of your bandwidth as we navigate your site. Here are a few features that Yahoo!'s crawler supports that you can use to help save bandwidth while ensuring that we get the latest content from your site:
Gzipped Files: Our crawler supports gzipped files to reduce bandwidth requirements. On average, you will get a 75% savings when you enable compression for your site. Many webservers provide mechanisms for sending out HTML content in a compressed format (for example, mod_gzip for Apache). How much of your site's total bandwidth you can save will depend on how much of your content is compressed and how well it compresses. In general, static pages are good candidates for compression. Any user agent, whether it is a browser or a search engine spider, will let the webserver know it can process compressed content by adding "Accept-Encoding: gzip, x-gzip" to the header of its HTTP request. All major browsers support gzip compressed content. Also you should be happy to know that if our crawler has any trouble with a compressed page, it will re-fetch the uncompressed version. In practice, it does encounter a small percentage of decompression failures.
Smart Caching: Our crawler acts very much like a web cache. Once we grab your content, we hold onto it and keep a history of how it changes over time. We do this for a variety of reasons. One of them is so that we can use HTTP mechanisms designed to help reduce network usage when a client (that's us) repeatedly fetches a web file that has not changed. In particular, our crawler often sends the HTTP If-Modified-Since header (see section 14.25 of rfc 2616) when making repeat requests. If your webserver is setup to recognize this header, it will respond with a 304 HTTP status code instead of a 200 if the content is unchanged. The advantage of this is that a 304 doesn't include your page content, so it uses up less bandwidth than a full 200 response. Again, I'd like to emphasize that our crawler is conservative when it comes to ensuring it has the latest content; it won't use an If-Modified-Since request if it needs to re-fetch your content for any reason.
Most webservers will automatically handle If-Modified-Since requests for static content out of the box. Proper cache control of dynamic content (such as PHP pages and cgi scripts) can be tricky and is an advanced topic. In most cases, servers will play it safe by ignoring If-Modified-Since requests for dynamic content. There are several sites on the web that let you test the cacheability of your web pages. For the purposes of our crawler, pay attention to what they say about the Last-Modified value in your response header.
Crawl-Delay: There's one last trick you can use to help reduce the bandwidth requirements of your site. You can use a special robots.txt directive, crawl-delay, to reduce the speed at which our crawlers make requests to your site. This allows webmasters to manage their bandwidth without restricting content on their site from crawlers and is being used effectively by sites like Slashdot. A safe value for this would be a delay that would allow us to fetch every page on your site in about five days. So a five second delay (crawl-delay: 5.0) would be fine for a site with 2,000 pages, but not for a site with 100,000 or more.
We hope you find these tips for safely saving hosting bandwidth useful and we'd appreciate any feedback, questions or new ideas to further help improve how our crawler interacts with your web sites.
Dave Simpson
Yahoo! Search Engineering


Comments
Dave, thanks for posting this, its very helpful.
Question about Slurp: I manage a bunch of web sites and I noticed that Slurp was getting a huge number of request errors. Upon closer inspection, I noticed that Slurp appears to be ignoring the HTML base tag. I hypothesize this because I see thousands of HTTP GET request by Slurp that are erroneous but consistent with ignoring the base tag. (It appears Slurp is resolving relative URLs against the current page address rather than the address specified in the base tag.)
Can you confirm or deny this hypothesis? If this is the case, any word on when Slurp will honor base tags?
-- fas
Posted by: F. Andy Seidl | February 11, 2005 09:23 AM
Hi Andy,
We do honor base tags. We also make a habit of tracking down reports like this, in case we have made an error. So I've asked our tech support folks to contact you for specific examples.
Posted by: yahoo search engineering | February 12, 2005 04:28 PM
Craig,
I wouldn't worry about the results from the cacheability test. This just means each time we crawl your home page your server will send us the full content.
The hints in this entry were about how to save bandwidth... whether or not a particular page is cache friendly or whether or not you gzip compress your files has no bearing, ultimately, on how you are crawled or included in our index. They are just mechanisms you can use to help reduce the bandwidth we use to crawl your site.
Posted by: Dave | February 13, 2005 10:17 PM
hi Dave,
Thanks for clearing that up for me.
Posted by: Craig | February 14, 2005 07:00 AM
Hi Dave,
Our own site has over 100,000 pages. The slurp bot visits every day but only hits us with about 150 Hits. Meanwhile MSN takes at least 2000+ and Google over 4000+ a day. Both MSN and Google have most of our pages cashed but Yahoo has only a small amount.
Is there anything we can do to help the bot hit us harder. We have the capacity as we are on Rackspace hosted servers so bandwith is no problem at all.
Any advise you can give on this?
Posted by: Richard Clarke | February 14, 2005 03:41 PM
Dave,
Y! crawls our site impressively- about 8,000,000 hits on Jan, 3,500,000 hits this month (keep in mind we're half way through the month..)
However, site:mydomain shows only 15 pages on index. And it's been the case for about 5 months now.
Any idea what is happening and why pages aren't getting into the index?
I hope you can answer.
Posted by: Aviv | February 16, 2005 12:02 PM
Richard and Aviv,
Thanks for your comments. Someone from our support team will get in touch with you to discuss the specifics of your individual cases.
-Dave
Posted by: Anonymous | February 16, 2005 11:24 PM
Richard and Aviv,
Thanks for your comments. Someone from our support team will get in touch with you to discuss the specifics of your individual cases.
-Dave
Posted by: Dave | February 16, 2005 11:26 PM
Thanks Dave,
I look forward to hearing from them.
Kind Regards
Richard
richc@redgoldfish.co.uk
www.redgoldfish.co.uk
Posted by: Richard Clarke | February 17, 2005 04:45 PM
Does this mean Slurp ignores ETag?
Posted by: Thomas Scholz | February 21, 2005 09:15 AM
ok, 1 simple question, how to I tell slup to stay of my site? they are nailing me to no end, i just want it to go away and robots.txt has not helped. I do not need slurp I want out, how do I opt out of this? thanks
Posted by: bot slave | February 23, 2005 06:32 PM
Thomas:
No, our crawler doesn't use etags at this time. It's something we might consider adding, but most server implementations use etags in addition to the last-modified header
"bot slave":
You should send feedback to our support staff via the link in blog posting ("feedback, questions or new ideas") if a robots.txt solution isn't working for you.
However, robots.txt is the way to opt out... you can disallow our bot altogether, though if the issue is rate of requests, use the crawl-delay tag instead. You may want to re-verify that the bot that is causing you problems is actually Yahoo Slurp as there are other Yahoo crawlers that have different agent names. Otherwise make sure you validate your robots.txt file (try a search for "robots.txt validator").
Posted by: Dave | February 25, 2005 11:50 AM
Hi Dave,
Not heard a thing yet from anyone in your sales support. Your bot is still producing a mere 150 odd hits a day at our site with just 723 pages out of 100,000 taken in over a year!.
So far this month c3000 slurp bot hits V 140,000 google bot hits & 40,000 MSN bot hits, even Ask jeeves hits us harder than your bot taking 26,000 hits!.
I just dont get it, We feature high in MSN, ASK, have various high positions in Google yet feature nowhere in Yahoo yet poor sites full of spam and doorway pages dominate the top of the sections where we should feature.
It certainly would be a start if the volume could be some how turned up a bit when your bot visits us.
Any advise you could give us would be apreciated
Kind Regards
Richard Clarke
richc@redgoldfish.co.uk
www.redgoldfish.co.uk
Posted by: Richard Clarke | February 26, 2005 09:59 AM
Hi Dave,
Sorry im sounding like a right pain but as yet we still dont see any action.
Bot action Today, Google Hits 23,000, MSN 8,748 ASK 5,490 Inktomi Slurp 91 hits.
Even Alexa is hiting us harder than your slurp bot.
If you can advise what we should try to do to get your slurp bot to hit us harder it would be much appreciated. We have everything on the pages, links, site map etc, etc yet currently only a small number of pages have been cashed by your slurp bot.
Thanks in advance
Richard Clarke
richc@redgoldfish.co.uk
www.redgoldfish.co.uk
Posted by: Richard Clarke | March 1, 2005 05:08 PM
Yahoo is generating various 404 in my site..
There´s a lot of 404 that:
www.waltercruz.com/ler/graca
but the correct is www.waltercruz.com/morningstar/ler/graca
For some reason, the Yahoo is not reading the morningstar part :(
Posted by: Walter | March 2, 2005 08:56 AM
Hello,
My site seems to be getting flooded with requests from Yahoo-Newscrawler/3.8-RSS" and "Yahoo-Newscrawler/3.9 RSS". I was wondering if it responds to 'Crawl-delay' in the robots.txt file? I've tried and it doesn't seem to.
It also seems to ignore my disallow. This is what I have in my robot.txt file:
User-agent: Yahoo-NewsCrawler*
Crawl-delay: 5
Disallow: /servlets/
Thanks,
Ed
Posted by: Ed | March 7, 2005 01:50 PM
The Yahoo-Newscrawler does not yet support the crawl-delay option in robots.txt. It is a different crawler than the Slurp web crawler and is only used to crawl selected news sites.
Kind Regards,
Paul Loberg
Yahoo! News Search engineer
Posted by: Paul Loberg | March 8, 2005 10:53 AM
My sites are getting *lots* of malformed URL requests from Slurp--or an agent claiming to be Slurp; perhaps it is a spoof.
I set up some mod_rewrite rules to detect several classes of bad requests that would result from failure to recognize a base tag. For the past several weeks, I've been logging such requests. Over that time, the vast majority of such requests (over 95%) claim to be from Slurp. I have yet to see a request in this log from Google or MSN.
So, I suspect that either Slurp is sometimes missing base tags or my sites are being visited by an agent spoofing as Slurp.
I'd be happy to share details with a Yahoo! engineer to help track down the root cause. If there's a Slurp bug, it would be good to identify that. If its a spoofing agent, at least we could start identifying IP addresses to ignore.
Posted by: F. Andy Seidl | April 2, 2005 01:56 PM
Hi,
I wonder about the order of robots.txt directives for slurp and other robots.
For about a week I have a crawl-delay directive in place for 60 seconds - with slurp coming in very regularly at 2 hits per minute. This is a lot better than the 30 requests per minute I had to deal with before, so obviously slurp obeys the directive at least in part.
Now I wonder about my disallow-directives:
If I have a special Slurp section in my robots txt like this:
-----------------snip----------
User-agent: Slurp
Crawl-delay: 60
---------------snap-------------
then do I have to repeat all allow/disallow commands in this section or will slurp take them from the global "User-agent: *" section?
Also I wonder whether the order of
User-agent: * and User-agent: Slurp sections matters. Given that many spiders have size limits when parsing robots.txt I think that the global section must be the the first with specific User-agents sections following.
Regards,
Ulrich
Posted by: Ulrich Babiak | April 11, 2005 12:34 AM
good article!!!
is there anything in robots.txt that can speed up the indexing process as opposed to slowing it down ie crawl delay....
Posted by: kris | April 14, 2005 05:36 AM