« Achtung Maybe: Report from the ETech Attention Zone | Main | Know Any Good Engineers or Operations Managers? »

March 15, 2006

A Chat with Andrei Broder (Part III)

A while back, Andrei Broder, a Yahoo! Research Fellow and Vice President of Emerging Search Technology, spent an afternoon telling us a bit about his decades-long history within the search industry and talking about his future projects. To wrap up our interview, we close with some of Andrei�s observations on several Yahoo! Search blog reader questions.

So, several people asked how you feel about what happened to AltaVista�

AltaVista had almost perfect bad timing; it started with huge technology advantage but an unsustainable business model at that time, and squandered its early lead in core search competency.

One of the questions from a reader was about your taxonomy paper. Can you talk a bit about that?

In there, I talk about the three stages of search, that I mentioned before. Web search started in early-mid nineties, really as a scale-out of the classic information retrieval model. At that time, people were still trying to find the best way to adapt classic information retrieval to the scale of the Web: Boolean models, Probabilistic models, etc. The second phase, in the late 90s, was about metadata. Hyperlinks, labels, clickthru data, all sorts of metadata of any kind. The structure of the web. But it was still very syntactic in nature, basically matching words against text. There is no understanding of meaning here. The third generation, still in progress, is about text semantics and analysis, where you starting to understand what the queries are about. That�s roughly where the paper stops . And now there are things like Yahoo! Shortcuts, or a lot of the information that is derived from the meaning of the query. Semantic, shortcuts, local search all seems to be taking off. So it seems that the paper was correct in predicting semantic search as the next phase at the time. Of course if I were to expand the paper now, I would write about the fourth phase: information supply.

Have you looked at blog search? Why does it �suck�?

Blog search is difficult. If you look at web search in general, the biggest help comes from metadata, anchor text, links, web graph analysis, etc. For blogs we have very little useful metadata. And even if you do have metadata for blogs, it is often wrong, or you can�t trust it, so you use it very little.

Furthermore, context is not always there. A lot of blogs are not self-contained, context surrounds them. Even a human doesn�t understand what�s going in a blog if dropped into the middle of it. I�m not sure how much progress we will see, (but then again, I�m not focusing on that!)

Finally, there were some questions about spam.

Any kind of information signal one might use in a search engine, spammers will try to pollute it. We need to be careful not only about link spam, fake sites spam, and so on but also about pollution in the query log, and other more subtle sources of information. On the other hand, spamming is an economics game, people think spammers are kids up to no good, but it is not. Spam is about economics, and we want to raise the difficulty of spamming to the point where it�s not economic to spam. As we go to a more personalized search experience, social aspects of search will play an increasing role. It remains to be seen to what extent this is spamable � It is hard to make robots that behave as humans, it might well be the case that social aspects of search are fairly spam resistant.

Thank you so much for your time, Andrei, and to all our readers. Please leave us a comment below and let us know what you thought!

Comments

///AltaVista had almost perfect bad timing; it started with huge technology advantage but an unsustainable business model at that time, and squandered its early lead in core search competency.

What kind of answer is THAT?!

Apparently, people WILL NOT learn from past mistakes
People do not want to deeply self-critique
People do practice denial

Unfortunatly, this spells bad news for Yahoo

Ironically, Yahoo IS becoming that AltaVista (deja vous')

A few years ago, Yahoo had the lead that Google has now, WITH A FOUR YEAR HEAD START!!!

Look what is happening :-(

WE ALL HAVE CONSTANT OPPORTUNITIES TO LEARN FROM THE PAST!!! Whether we choose to.......

Is it possble to implement, or does the technology already exist, to filter out and drop websites that utilize redirects, before and after indexing?

It would seem to me that this would eliminate 75% of the spam that exists on Yahoo, and reward those of us who work hard to create valuable websites and content for your customers.

Thanks

Thanks

"The third generation, still in progress, is about text semantics and analysis, where you starting to understand what the queries are about. That’s roughly where the paper stops . And now there are things like Yahoo! Shortcuts, or a lot of the information that is derived from the meaning of the query. Semantic, shortcuts, local search all seems to be taking off." Strangely no one ever seems to mention serious empirical study of query types that affect critical decisions, but don't occur via search engines, e.g. answers to questionnaires that citizens ask during elections. Without this, and the underlying policy terms they include and imply, you'll have a very difficult time getting at any "meaning. Without knowing the decision taking place (in that case a vote or a decision to endorse or simply to score a candidate or party), a concept of the domain involved, and a concept of the words one finds in the query and in the response, you'll be able to get more than surface scratches at the meaning.

Actually having the various perspectives and the political factions on this competing using a means of semantic framing, is probably a phase of its own though it could be folded into "the fourth phase: information supply" since factions compete to provide that supply and frame it as they find effective for the answers that they want to give.

It's inherently competitive, almost like spam, but on a semantic level. That's why most efforts will fail, they just aren't disciplined enough at the contextual framing, and don't pay enough attention to perspective or task context. Go read the UN State of the Future Report 1993 and 1994, it has quite a bit to say about the semantic web, or what is sometimes called the sociosemantic web. There are few or no projects that actually address this though there are some proposals here and there, and some quiet prototypes.