Yahoo! Search Crawler, Slurp, has a new Address and Signature Card
- Posted June 5th, 2007 at 7:49 am by Yahoo! Search
- Categories: Search
As we mentioned a few weeks ago, we’ve been migrating our crawler, Yahoo! Slurp, over to the new domain at crawl.yahoo.net. As of today, the transition is complete and all machines crawling as Slurp are now in crawl.yahoo.net. You can see this change in your web server logs, where the page accesses from inktomisearch.com are being fully replaced by crawl.yahoo.net contacts. Note that this does not cover other Yahoo! crawlers, such Yahoo! China, and other verticals, like Yahoo! Shopping, Yahoo! Travel, etc., which have their own user-agent.
Don’t fret though; there is no need to change your robots.txt file because the crawler user-agent is still Yahoo! Slurp. If you use IP based filtering, there is no need to change that either, since the IP addresses from which we crawl remain the same. However, please ensure that your network or firewall setup does not keep crawl.yahoo.net out as we won’t be able to include your content in our results.
With this transition complete, we also encourage you to setup reverse DNS-based authentication of our crawler to ensure that no rogue bots masquerading as ‘Slurp’ visit your site. Here is how it works:
- 1. For each page view request, check the user-agent and IP address. All requests from Yahoo! Search utilize a user-agent starting with ‘Yahoo! Slurp.’
- 2. For each request from ‘Yahoo! Slurp’ user-agent, you can start with the IP address (i.e. 18.104.22.168) and use reverse DNS lookup to find out the registered name of the machine.
- 3. Once you have the host name (in this case, lj612134.crawl.yahoo.net), you can then check if it really is coming from Yahoo! Search. The name of all Yahoo! Search crawlers will end with ‘crawl.yahoo.net,’ so if the name doesn’t end with this, you know it’s not really our crawler.
- 4. Finally, you need to verify the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2. If it doesn’t, it means the name was fake.
If you find a false DNS signature that you know is not ‘Yahoo! Slurp’ calling, you can manage access to your content appropriately. By simply returning an HTTP Error, you can block people from seeing your content.
We highly recommend you use this mechanism to manage access to our crawler, instead of using IP address based access. This ensures your setup to be robust for network and data center changes.
If you have any other feedback, please let us know.
UPDATE: While we’ve confirmed that all our production crawlers that are crawling under the ‘Slurp’ user-agent have now been migrated over to crawl.yahoo.net, we wanted to clarify that some test crawl machines (that are used to test various ongoing improvements to our algorithms) may continue to crawl from inktomisearch.com. The few stragglers may still leave âinktomisearch.comâ in your web server logs, but rest assured, we intend to move everyone over to the new domain.
- 12 Comments