« All Streaks Must End | Main | Yahoo!'s Year End Party Celebration »

December 06, 2004

Tim Converse Interview, part 2

As promised, this is the second half of the Interview with Tim Converse.

JQ: There's been this sort of continuum evolving since the Alta Vista days, where we started with primarily static content and then the next generation was catalogue type shopping like Amazon. Now it's this whole micro-publishing independent thing. What do you consider the biggest challenge right now in terms of classifying content?

A: One of the big challenges for us is just understanding what's out there. It's almost like astronomy where you're just trying to catalogue all the different things out there and track how they're growing, to some extent. So it's important for us to know just how many sites and kinds of documents there are so that we can catch trends.

JQ: You talked about comprehensiveness. There's this perception that there's the web that most of us see and then this dark web: the stuff that the crawlers don't reach. How do we try to get that data into the index? Are there barriers that webmasters put up that they should avoid to help us better index the content?

A: At it's simplest, webmasters aren't aware of robots.txt and it's uses. Redirection can also be problematic if people create content by creating lots of domains or hosts so we encourage people to organize their sites in many documents before they get a new host.

And of course, there's also the issue of crawler traps which some people do intentionally but much more often, they've unintentionally created crawler traps....

JQ: ...and a crawler trap is...

A: A crawler trap is something where you crawl a page and it has a link, usually in the same site that's dynamically created and then you follow that link and it has another analogous link that's dynamically created and often, just because people make mistakes, you're attaching on another directory every time which doesn't exist and takes you back to an automatically generated error page which has the same link. So you can fall into traps where there are an infinite number of pages that don't have any content.

Another thing people can do to help us is, this is sort of geeky but, don't make page not found pages that return a status 200.

JQ: I was just about to ask that. 404 pages back in the day, were these ugly grey things with block text that all looked the same and now they're done up to look like regular pages to be more appealing to users.

A: We do actually have ways of detecting that but it's a lot easier for us if a web server just says, "this page doesn't exist" as opposed to creating a nice page for the user that to a crawler looks like any other page. In general, if the server tells us 404, then we discard it.

YQ: I worked for a company that used CIDs instead of cookies to follow users through the site and it turned out to be a disaster. We went from having pretty much every page indexed to hardly any. So what about CIDs and how they affect the crawlers?

A: If you have differences in the URL that don't actually make a difference in the site, that can be hard for us to untangle. We're getting better at it. One of the scenarios you're talking about there would just create a lot of duplicates for us. So it's nicer for us if we have one URL per actual content but we understand that you're not designing this just for us. And we obviously do a lot of duplicate detection--actually, we do duplicate detection in a couple of different ways. Finding out if documents are the same; finding out if sites are mirrors of each other.

JQ: This question came up today on a mailing list that I'm on. The concern for this particular company is that they want to move their site to a new domain but they don't want to become invisible for the next six months or year or however long it'll take for people to point to their new website. What can we tell people like that?

A: We can tell them that in the future, if you actually want to move your site, you want to use a 301 redirect which will do as much of the right thing as we can.

YQ: What actually happens there? I've heard of companies who have used 301 redirects and yet their old pages continued to show up in the search engines anyway. Why is that?

A: The underlying problem is that people out there haven't changed their links and search engines do pay attention to links.

I can't give you a date, but we're changing how we deal with redirects. The thing about redirects is that everyone thinks it's obvious how a search engine should treat them and the obvious answer is not really that helpful. Any policy you develop with redirects is going to make someone unhappy but what we're about to roll out we will pay better attention to 301 redirects and the exact problem you're talking about should be less.

[In the time since we met with Tim, the team has rolled out a fix for 301/302 redirects. Documents will be handled by the new redirect policy as they are re-crawled and re-indexed and webmasters will start to see many of the sites change in the next couple of weeks. The index should be fully propagated within a month. See Tim Mayer's Webmaster World presentation for details.]

YQ: You mentioned earlier that you'd just bought a piano. I read on your website that when you were eight, you ran away from home to escape piano lessons. Is that true and did you just hate piano back then?

A: Yeah, that's true but I never hated piano, just lessons. Now that I've bought the piano, I'm practicing again. Right now I'm learning a classical piece by Bach. I'm a slower learner now though--I should have stuck with it when I was eight. But when it comes to the kind of music I actually listen to, I like rock, hip hop and classical. I'm not too into jazz.

JQ: Which do you listen to when you're programming?

A: (laughs) I don't actually. I don't deal with headsets well.

JQ: In terms of freshness, there's a lot of talk about how quickly an RSS-watching engine will pick up new content as opposed to getting stuff into Yahoo! Search. The question they ask is why can't we just ping Yahoo! and get the crawler over here?

A: Well, that's not the only source of latency or possible delay. We build very large databases and it's kind of a large industrial process involving lots and lots of machines. There's some delay between the last document we heard about and the time we actually put something live. In some cases that delay matters more than any delay in finding out if something has changed. We also pay a lot of attention to whether something has changed. But I think you'll see us getting fresher and fresher.

YQ: When you're looking for things that are changing on the page, what are you specifically talking about? I'm sure it's not enough to just change a hidden date stamp in a footer.

A: Yes, It's more than that. Most of the web is just static even without there being date stamps. We do have a more nuanced notion of what it means to change so we can detect a trivial change from a significant one. We can tell a major change from a trivial one.

YQ: You mentioned that you were between the scientist and the programmer. What do you think you uniquely bring to what you're doing?

A: Well I'm hoping that one side thinks I'm the other side and that the other side thinks I'm the one side. (laughs) I'm hoping I've got them all fooled but I have this feeling that I don't. But no, I think that what I uniquely bring is that I can talk to both sides, I've been a programmer; I've trained to some extent in the direction that the scientists have trained and went to grad school for a long time in related topics so increasingly, though I never thought I would play this role and it's not what I envisioned, what I'm bringing to it is I can talk to a lot of different sides and I can prioritize the stuff and my grasp of the technology's not so bad either.

Yvette Irvin
Y! Profiler