Questions for Tim Converse about Content Classification?
Yvette and I are planning to sit down to chat with Tim Converse. Much like she did with Paulien, we’ll ask some questions about where Tim came from as well as what he and his group are up to these days.
I asked Tim for a description of what his group is all about so that we could solicit questions from those of you outside of Yahoo. He said:
I manage the Content Classification group within YST (that’s the backend of Yahoo search). The Content group does all the crawling, indexing, and webmapping of documents for web search, and my group is responsible for categorizing those docs. We write software to algorithmically classify web pages with a special focus on catching search engine spam. We also write software to help us understand what the Content system is doing.
So if there’s something you’d like us to ask Tim, leave a comment below.
Jeremy Zawodny
Technical Yahoo!

Great Blog Jeremy!
Hello Tim: Perhaps you can tell us a little bit more about how you guys are capable of spotting correlations between categories and how that can affect the classification of web pages within the software’s algorithms?
Thanks!
Nacho Hernandez
Tim: Can you talk about how Y! sees spam versus ham? (The i’d tell you, but i’d have to kill you response works..)
Do you use any open source solutions in the process, perhaps SpamAssassin’s engine?
Hey Tim,
Does/will the content group make any use of the Yahoo! Directory? Or other human-created directories and/or ontologies?
Hello Tim,
A competitor of mine has 4 websites with identicle content. All 4 sites have a page ranked quite high on Yahoo! SERPs for “luray va cabin rentals”. The SERP pages are different from each other but again all 4 pages are on each site.
I have reported this to you as Index Spamming a couple of times but the 4 websites are still in business. IS HAVING MIRROR WEBSITES NOT CONSIDERED INDEX SPAMMING ANYMORE??
Thanks for your response,
Karl Baldwin
Hi Tim and Jeremy,
Thanks for hosting an open Q&A session. I just have a few questions:
1. I noticed that you said “algorithmically classify web pages” above. What do you think about the new MSN Search’s Block Level Analysis? Do you believe that one page can contain content that is relevant to disparate categories or does there always need to be an overarching theme?
2. Is your categorization system based more on creating an ontology that lasts or being able to create an ontology that morphs along with user behavior?
3. In mid-October, I posted about the 301 redirect issue on Jeremy’s blog since I was seeing websites with 301 redirects being penalized for having “duplicate” content. Now, I’ve seen some websites with redirects regain their rankings, but the pages are ranking under the old URLs and the new URLs are not being indexed. Is there still work being done on the 301 issue? And do you have any idea on when it will be resolved?
Thanks again for your time and consideration. I look forward to reading the upcoming discussion.
What about prank sites with wildcard subdomains, like IsGay and WasArrested, which repeatedly mangle the results for my name — http://search.yahoo.com/search?p=ordoveza (which is an issue I’ve tried to raise with Yahoo before)? Aren’t these artificially inflating their own rankings by making themselves appear to be more sites than they are? How do you plan to filter these?
Hi Tim
Are your algorithms able to automatically recognise the context of a page?
Are the keywords that you retain, representative of the context or simply representative of being present on the page in question?