« See More on Yahoo! Maps | Main | In the City by the Bay for Web 2.0 Expo »

April 14, 2008

Yahoo! Slurp 3.0

Over the past few weeks, we've been preparing for the latest version of the Yahoo! Search crawler with some infrastructure updates, which recently caused a variance in our crawl behavior.

With everything now in place, the rollout has officially begun. The new Yahoo! Slurp 3.0 recognizes the same user-agent and all robots.txt directives for 'Yahoo! Slurp,' though it'll identify itself as Slurp 3.0 in your web logs.

As the new software undergoes a phased rollout to our production crawlers over the next several weeks, you'll see the following changes:

    a) The crawlers will start crawling from a different and much smaller set of IP addresses, but it'll still be from the crawl.yahoo.net domain. Any reverse DNS checks to identify our crawler will continue to work. Please note that if you're using IP-based recognition of our crawlers, you might see a drop in crawl/coverage from Yahoo! We strongly recommend that you move to reverse DNS-based identification of Yahoo! Slurp if you're using any other method to avoid this problem. The current set of IPs will disappear from your web logs in the next several weeks.

    b) The crawlers will also publish a new user-agent, 'Yahoo! Slurp/3.0.' Existing robots.txt directives for 'Slurp' or 'Yahoo! Slurp' will continue to work, but if you have directives specific to 'Slurp/2.0,' they won't be recognized by the new crawler (though usage of the 'Slurp/2.0' user-agent is very rare on the web, so you won't likely be affected). We recommend specifying the shorter version of: User-agent: Slurp. Check out "How do I prevent my site or certain subdirectories from being crawled?" on our Help page for more details.

These changes will affect the main Yahoo! Web Search crawlers. Crawlers that similarly respect the Yahoo! Slurp directive but identify themselves more specifically, such as Yahoo! Slurp China and others, will not be impacted.

Let us know if you have any questions or observe anything unusual.


Sharad Verma & Yoram Arnon
Yahoo! Search

Comments

I dont know if it's part of the migration, but i'm getting this:
User Agent: Wget/1.10.2 (Red Hat modified)<-------
Get String: www.yummyfood.net/index.php
Forwarded For: 67.195.4.142
Client IP: none
Remote Address: 67.195.50.114
Remote Port: 40084
Request Method: GET

I do not allow wget on my sites, so the bot is being blocked all the time.
What's up?

I'm getting the same handyszene's logs, I've verified them and they come from crawl.yahoo.net addresses, using wget as user agent.

HTTP_VIA:1.0 llf330007.crawl.yahoo.net:4080 (squid/2.6.STABLE1)
HTTP_ACCEPT:*/*
HTTP_USER_AGENT:Wget/1.10.2 (Red Hat modified)
HTTP_X_FORWARDED_FOR:67.195.4.140
&ALL_RAW=Cache-Control: max-age=31536000
Via: 1.0 llf330007.crawl.yahoo.net:4080 (squid/2.6.STABLE)

The crawler is indexing a site with robots.txt that disallows crawling!!!!

Now I'm blocking requests using the ip range, please fix this bug !

I'm facing the same problem as Tarry. The same crawler indexes a robots.txt file and everything gets stuck.

Has Yahoo had an update over the past several days.

It appears that social sites are even more dominant on the SERPs than they were during the previous month

Unfortunately, this bias towards social sites means excessive amounts of spammy sites getting high rankings - even pages that are redirects or no longer valid.


Look at this example while just shopping for Mothers Day and Birthday gifts:
http://search.yahoo.com/search?p=replica+handbags


The concept of focusing on the social web has tremendous potential, but it must be tweaked and optimized so as to protect search quality

Thank You for your time :-)

What about the Latin America market?
Should we be expecting some "radical" changes on the SEPR's?

I appreciate the update. I'll change the directives that are specific to 'Slurp/2.0,' now that they won't be recognized by the new crawler.

Nice work, after some tweaking all seems to work fine, thanks!

I hope this changes will give great impact to my SERP.

Great job. Thanx a lot.

Its really Great to read your blog, I’ve learn lots of tips. thanx,

I've seen it in my logs already. Slurp is working!

thanks for that

Slup is real great and it has really good functionality. Go on Yahoo!

Thanks for the update.

This is my first experience visit this blog, it was great

I have a site that has a depth of about 6 million pages to be index. I have uploaded sitemaps and I see the slurp bot on the server everyday but it crawls like a snail, what could possibly be the reason? That bot has plenty of food on this site that it should be running like an SOB. I have another site that has a depth of about 800k links but Yahoo has never indexed more than 4k in the last 4 years. Any help is appreciated. Thanks, Paulie Walnuts.

Really great info. That explains why your crawlers looked little bit "lazy"recently.

That is a real update!

Using Yahoo slurp is amazingly easy and convenient. A real fun tool....

really easy and usefull

Thanks for the update.

good work, after some tweaking all seems to work fine, thanks!

Post a comment