Tim Converse Interview, part 2
As promised, this is the second half of the Interview with Tim Converse.
JQ: There’s been this sort of continuum evolving since the
Alta Vista days, where we started with primarily static content and
then the next generation was catalogue type shopping like Amazon. Now
it’s this whole micro-publishing independent thing. What do you
consider the biggest challenge right now in terms of classifying
content?
A: One of the big challenges for us is just understanding
what’s out there. It’s almost like astronomy where you’re just trying
to catalogue all the different things out there and track how they’re
growing, to some extent. So it’s important for us to know just how
many sites and kinds of documents there are so that we can catch
trends.
JQ: You talked about comprehensiveness. There’s this
perception that there’s the web that most of us see and then this dark
web: the stuff that the crawlers don’t reach. How do we try to get
that data into the index? Are there barriers that webmasters put up
that they should avoid to help us better index the content?
A: At it’s simplest, webmasters aren’t aware of robots.txt and
it’s uses. Redirection can also be problematic if people create
content by creating lots of domains or hosts so we encourage people to
organize their sites in many documents before they get a new host.
And of course, there’s also the issue of crawler traps which some
people do intentionally but much more often, they’ve unintentionally
created crawler traps….
JQ: …and a crawler trap is…
A: A crawler trap is something where you crawl a page and it
has a link, usually in the same site that’s dynamically created and
then you follow that link and it has another analogous link that’s
dynamically created and often, just because people make mistakes,
you’re attaching on another directory every time which doesn’t exist
and takes you back to an automatically generated error page which has
the same link. So you can fall into traps where there are an infinite
number of pages that don’t have any content.
Another thing people can do to help us is, this is sort of geeky but,
don’t make page not found pages that return a status 200.
JQ: I was just about to ask that. 404 pages back in the day,
were these ugly grey things with block text that all looked the same
and now they’re done up to look like regular pages to be more
appealing to users.
A: We do actually have ways of detecting that but it’s a lot
easier for us if a web server just says, “this page doesn’t exist” as
opposed to creating a nice page for the user that to a crawler looks
like any other page. In general, if the server tells us 404, then we
discard it.
YQ: I worked for a company that used CIDs instead of cookies
to follow users through the site and it turned out to be a disaster.
We went from having pretty much every page indexed to hardly any. So
what about CIDs and how they affect the crawlers?
A: If you have differences in the URL that don’t actually make
a difference in the site, that can be hard for us to untangle. We’re
getting better at it. One of the scenarios you’re talking about there
would just create a lot of duplicates for us. So it’s nicer for us if
we have one URL per actual content but we understand that you’re not
designing this just for us. And we obviously do a lot of duplicate
detection–actually, we do duplicate detection in a couple of
different ways. Finding out if documents are the same; finding out if
sites are mirrors of each other.
JQ: This question came up today on a mailing list that I’m on.
The concern for this particular company is that they want to move
their site to a new domain but they don’t want to become invisible for
the next six months or year or however long it’ll take for people to
point to their new website. What can we tell people like that?
A: We can tell them that in the future, if you actually want
to move your site, you want to use a 301 redirect which will do as
much of the right thing as we can.
YQ: What actually happens there? I’ve heard of companies who have used 301 redirects and yet their old pages continued to show up in the search engines anyway. Why is that?
A: The underlying problem is that people out there haven’t
changed their links and search engines do pay attention to links.
I can’t give you a date, but we’re changing how we deal with
redirects. The thing about redirects is that everyone thinks it’s
obvious how a search engine should treat them and the obvious answer
is not really that helpful. Any policy you develop with redirects is
going to make someone unhappy but what we’re about to roll out we will
pay better attention to 301 redirects and the exact problem you’re
talking about should be less.
[In the time since we met with Tim, the team has rolled out a fix
for 301/302 redirects. Documents will be handled by the new redirect
policy as they are re-crawled and re-indexed and webmasters will start
to see many of the sites change in the next couple of weeks. The
index should be fully propagated within a month. See
href="http://www.ysearchblog.com/files/wmw2004/search-friendly-design.ppt">Tim
Mayer's Webmaster World presentation for details.]
YQ: You mentioned earlier that you’d just bought a piano. I
read on your website that when you were eight, you ran away from home
to escape piano lessons. Is that true and did you just hate piano
back then?
A: Yeah, that’s true but I never hated piano, just lessons.
Now that I’ve bought the piano, I’m practicing again. Right now I’m
learning a classical piece by Bach. I’m a slower learner now
though–I should have stuck with it when I was eight. But when it
comes to the kind of music I actually listen to, I like rock, hip hop
and classical. I’m not too into jazz.
JQ: Which do you listen to when you’re programming?
A: (laughs) I don’t actually. I don’t deal with headsets
well.
JQ: In terms of freshness, there’s a lot of talk about how
quickly an RSS-watching engine will pick up new content as opposed to
getting stuff into Yahoo! Search. The question they ask is why can’t
we just ping Yahoo! and get the crawler over here?
A: Well, that’s not the only source of latency or possible
delay. We build very large databases and it’s kind of a large
industrial process involving lots and lots of machines. There’s some
delay between the last document we heard about and the time we
actually put something live. In some cases that delay matters more
than any delay in finding out if something has changed. We also pay a
lot of attention to whether something has changed. But I think
you’ll see us getting fresher and fresher.
YQ: When you’re looking for things that are changing on the
page, what are you specifically talking about? I’m sure it’s not
enough to just change a hidden date stamp in a footer.
A: Yes, It’s more than that. Most of the web is just static
even without there being date stamps. We do have a more nuanced
notion of what it means to change so we can detect a trivial change
from a significant one. We can tell a major change from a trivial
one.
YQ: You mentioned that you were between the scientist and the
programmer. What do you think you uniquely bring to what you’re
doing?
A: Well I’m hoping that one side thinks I’m the other side and
that the other side thinks I’m the one side. (laughs) I’m hoping I’ve
got them all fooled but I have this feeling that I don’t. But no, I
think that what I uniquely bring is that I can talk to both sides,
I’ve been a programmer; I’ve trained to some extent in the direction
that the scientists have trained and went to grad school for a long
time in related topics so increasingly, though I never thought I would
play this role and it’s not what I envisioned, what I’m bringing to it
is I can talk to a lot of different sides and I can prioritize the
stuff and my grasp of the technology’s not so bad either.
Yvette Irvin
Y! Profiler
