Announcing the Yahoo! Search Developer Network and Search Web Services
Developers often ask what it would take for us to offer a Web
Service API to Yahoo! Search. Often times they wanted to incorporate
our services into an application or web site they were building.
Before today I always had to give vague non-committal answers. In
reality we were already working on it.
But rather than simply release a set of APIs that expose all of our
"vertical" searches (web, image, video, news, and local), we thought
about what it would take to create an ecosystem around these services.
We quickly realized that building a real community around them was the
way to go.
Let's be real. No matter how large our image index grows in size, it is only as good as its ability to return the pictures you seek. And when each picture says a thousand words, just how deep would you have to reach into your vocabulary bank to find the right words to query? Now with more than 1.5 billion images in the index, here at Yahoo! Image Search we’re faced with a new dilemma: How do we help you navigate through this massive corpus to find exactly what you’re looking for?
Rather than forcing you to become an expert in query refinement, we’ve been thinking hard about ways we might help. For example, we noticed that a lot of sessions starting with the query “superbowl” ultimately ended up with users navigating to Janet Jackson’s famous “oops” moment. Duly noted. And while Image Search is probably not ready to pass the Turing Test, we’ve just taken some giant leaps in allowing you to express your queries in simple, straightforward language (as opposed to meddling with the Advanced Search link…)
Being the largest image search engine simply isn't enough. As we continue to expand the index size, we’re also working hard to make image searching easier and faster than ever.
While an accurate mind-reading machine is still a vision for the utopian future, search for “simpson” and our Also Try engine will make its best guess at what you are likely searching for. In this case, the engine detects that a top phrase related to the query is "Jessica Simpson," and offers this as the first “refinement” proposal. This is just like the "also try" feature on our Web Search but we've tweaked it specifically for image search. (See for yourself with a web search for simpson. Apparently Jessica ranks #1 in image search but people would rather read about Ashlee.)
And while our advanced search link won't be going away, there is an easier way to find that black and white shot of your favorite athlete. Type in the query as you would say it: black and white pics of tiger woods. A crazy concept, perhaps, but the results that follow are actually those you asked for: Tiger Woods in black and white photos, not the universe of the golfer’s images associated with the words "black," "white," and "pics." This means that you'll get the same set of results no matter whether you type in "pics," "images," "photos," or "pictures." The search engine knows what you mean.
This Transformed Query feature extends to size specifications as well:
Direct Display in Web Search
Lastly, in the spirit of “easier” and “faster,” you can now also preview image results directly in your web search results. See what our Direct Display returns when you type in the following web search queries:
Lately my mom looks at me like she has vastly overestimated my
intelligence for all these years. "You work on spelling?" she asks.
"It's your generation. Back when I was in school, they taught us how
to spell correctly." Stammering to my own defense, I argue that she
uses her word processor's spellchecker. She responds smugly, "So why
can't you just use something like that and work on something that
makes a difference instead?" Sigh.
Most people feel that spelling should be pretty easy--most of the
time, people are pretty good at it. Or at least they think they are.
The truth is, between 10 and 15% of the queries Yahoo Search receives
are misspelled. That means if we didn't do any spell correction, many
people wouldn't find what they're looking for. So why can't we just
use something like a word processor's spellchecker? Try typing the
following into your word processor and seeing if you get the right
evanesescence (should be evanescence, but I get no suggestion from my word processor)
tofurkey (should not be corrected, but I get turkey as a spelling correction)
Topics of interest are changing constantly. New company names pop
up, new celebrities come into public awareness, and new products are
launched. We've got one shot to correct every word in your query and
present just one suggestion to you. On top of that, all the words in
the query interact with each other. "Brittany Spears" should be
corrected to "Britney Spears" but "Brittany Murphy" should stay the
same. So we have exactly one chance to correct all the words in the
query using an ever-changing vocabulary and keeping in mind the entire
query's context to determine which corrections to make. Piece of
People find it to be a useful feature because correcting accidental
misspellings helps them find results for a query their fingers got in
the way of expressing, and let's be honest--unless you're a former
national spelling bee champion, there are some words that you simply
don't know how to spell in the first place. Here's a quick exercise.
Close your eyes and spell the governor of California's name. It's
S-c-h-w-a-r-z-e-n-e-g-g-e-r--how'd you do? I know I got it wrong on the first try.
Don't tell my mother.
Luckily our spellchecker does a better job than I do on my own. We've
taken a great deal of editorial and linguistic input to define what it
means to do spelling correction well. Then we did a massive amount of
data processing to build a system that embodies those policies. The
end result is a system that is better than its creators at spelling.
There are still cases it gets wrong, but these cases are getting fewer
and fewer with every release. With our latest release, we're giving
many more suggestions with higher accuracy than ever before.
So next time your fingers turn to mush and you get a "Did you mean..."
on Yahoo! Search, don't feel bad. Spelling is hard and my team and
our spellchecker are there to help you. Now I've just got to convince
Project Lead, Query Spelling Service
Ideas come and ideas go. Sometimes they blaze hot and move fast
through media spaces like print, television, movies, and pop music. Or
they burst onto the Internet and travel the gossip hotline of the
blogosphere. When they acquire an effortless momentum of their own,
keen observers talk about memeflow. Look,
look, it's the next big thing: portals, page rank, peer-to-peer,
blogs, RSS, social software, tags. The beat goes on.
But a meme is more
than a passing fancy; it's a self-replicating, widely adopted idea, an
idea with legs. Memes are of the moment, but their mission is to
evolve and endure.
The notion of memes borrows from the study of genes and genetic
evolution. Genes replicate, evolve, and spread biologically, while
memes are transmitted by human communication.
Over the last few months, you may have noticed the emergence of a
new meme: The Long Tail.
The long tail is a
familiar statistical truth: Small, everyday events are extremely
common, and big, momentous events (from huge blockbusters to great
catastrophes) are rarities that attract attention. This phenomenon
occurs in the natural world (there are many seismic blips and few
major earthquakes) and in the human realm. You can see it in the
distribution of wealth (there are very few billionaires) or population
(there are very few mega-cities) or popular search queries. Or, you
can see it in WordCount, Jonathan
Harris's interactive widget that displays the frequencies of word use
on the long tail of our language.
Back in October, Wired editor Chris Anderson authored a
thought-provoking and influential article titled The Long
Tail, and created a companion Long Tail blog,
which he describes as "a public diary on the way to a book." In recent
years, scientists and statisticians have applied Zipf’s
law, power laws, and Pareto distributions (the old 80/20) to
analyze and explore long-tail statistical phenomena on the Internet
and elsewhere. But Anderson's riff takes it one step further.
Anderson explores how the Internet has changed the laws of
distribution and the rules of the market. The barriers of time and
geography are down, so is the cost of storage. The limitless shelf
space of online commerce and the availability of powerful search
engines and free or cheap publishing and communication tools (email,
groups, message boards, instant messenger, groups, and weblogs) create
new economic, social, and cultural opportunities and new freedom of
Suddenly, the mainstream is not the only stream. There's room now
for babbling brooks, crooked creeks, and tributaries where trends pick
up momentum before they flow downstream. There's an audience for many
voices --for the eclectic and the unpopular: little blogs and the
micro-communities that cluster round them, small-press books on oddball
topics, indie music, and arcane genre
movies for niche audiences. Wikipedians
edit thousands of articles on hundreds of thousands of topics. Breadth
of content thrives in environments that are collaborative,
distributed, bottom-up, and driven as much by passion as by
In a comment to one of Anderson's blog posts soliciting definitions
for the Long Tail, an Amazon employee described the marketplace sea
change this way, "We sold more books today that didn't sell at all
yesterday than we sold today of all the books that did sell
It's no longer necessary to focus myopically on bestsellers and
mass appeal, that's the message. Devoted enthusiasts, professional
amateurs (pro-ams), and like-minded people find each other to
create communities of interest, spread influence, and share
recommendations. Thousands of RSS feeds bloom, and anyone, anywhere
can find them, mix them, and add them to My
Yahoo!. On Yahoo!
Shopping (or Amazon or eBay), merchants add their goods to a
global catalog connecting consumers to any title, any product, any
brand. On Overture,
merchants buy keywords that drive business to an "abundance of
There are plenty of pilgrims on the long tail trail. To travel this
road wisely and well, we all need long-tail tools that support
self-expression, personalized search, recommendations, and
trust. Anderson's vision shows us a horizon as vast and limitless and
rich with possibility as the long tail itself.
What aspect of the Long Tail makes your tail wag? As always, we'd
love to hear your thoughts.
You could never say that Reiner Kraft lacks vision or inspiration.
This unassuming guy with the soft voice and thick German accent comes
up with ideas--and incredibly viable ones--the way Snoop Dog flows
Reiner's recent brainchild, Y!Q was launched in beta last week. Based on his concept of
"disruptive distribution" technology, he believes it will
significantly change the face of search.
Here's what Reiner had to say about his passion for search innovation
and what it means to provide information "at the point of
Q: You've coined this phrase "disruptive distribution"
technology and you use it a lot when talking about Y!Q. What exactly
A: It's a mechanism for distributing search boxes all over
the Internet. As it relates to Y!Q, it's an API for webmasters that
lets them insert icons within their content so that their readers can
access related information about that content without having to leave
Q: So the distributive part makes sense. Why
A: Because it changes, or potentially changes, the way
people search. Rather than having to go to a special page to perform
a search, a search box is always a click away. You don't even have to
type in a query. You can, if you want to refine your search further,
but really it's optional.
Q: How does all this roll into Y!Q?
A: The key to Y!Q is the idea of contextual search or
relevant "information at the point of inspiration." People liked
to use that phrase before but with Y!Q it's becoming a reality. The
idea is that there is always a context to what a user is reading or
working on. So if they want to do a search, that search will be
related to it somehow.
With Y!Q, we're able to identify what that context is and provide
search boxes right where you need them. Then a user can dig deeper
and ask more questions without interrupting their workflow.
Then of course, there's the API that the content owners or
webmasters can use to integrate Y!Q into their pages. Now their
readers can click on the Y!Q icons and automatically find more
information about a subject. So in this case, the user isn't
specifying the context, the content provider tells us, "this is the
piece" that the user is interested in. It works as fine as when
the user selected the context themselves.
Q: I like that it's the user can specify what they want.
That's probably appealing to a lot of people.
A: Right. The other thing is that if we tried to
automatically identify the context, we'd never get it 100% right.
We'd just be guessing. But because the user says, "this is the
piece of information I'm interested in," Y!Q can get the context
right on the first try.
What's happening is the information they've highlighted gets
transmitted to our search where our algorithms extract the key
concepts and give them relevant results back.
Q: This question was posted by a blogger who thinks content
publishers could use the Y!Q icons to help generate ad revenue. He
asks, "Are there any plans to add contextual advertising to
A: That's an interesting proposition. Y!Q is a new beta
product and we're planning a lot of enhancements; but first and
foremost we're focusing on giving publishers more control over the
display and content in Y!Q. As we develop new features, we'll make
sure to post them on the blog.
Q: Another blogger asks, "do you think Y!Q will phase out
once the novelty factor wears off?" and "do you think it'll be used as
a serious search solution by working professionals, [not] just cool
A: Y!Q was designed to address two key issues: First, we
want to provide convenient access to search functionality at the point
of inspiration. Second, we want to push relevant and enhanced results
related to the context and provide superior relevancy for search
results. If we're doing a good job for one and two, I think Y!Q has a
very good chance of being adapted and used widely. Users generally use
the search tool that is easiest to use and produces the best
results. So I believe that Y!Q will be gradually accepted as the next
generation search tool of choice.
For the second question: I already use Y!Q as my default search
engine in Firefox, and it produces more relevant results compared to
other plug-ins. Therefore anybody can use it as a default search tool.
I don't think there is a preferred audience.
Q: Tell me a bit about your patents. You actually have one
A: I don't know the exact number. I filed probably over 100,
and so far on the order of 40 have been issued. It typically takes
about 2-4 years for patents to issue, so they're coming all in
A: That was mostly between the time of '98 and around 2001
Q: Are they all related to search technology?
A: No. A lot of them are, but there are many others that are related to different type of Web technologies, for example e-commerce or location awareness technologies. Especially the latter ones may become more important soon once GPS devices [e.g., cell phones] appear on the market and become more broadly used.
Q: Aren't you also finishing up your thesis?
A: Yes, it's about domain specific search and is based on what I call iterative filtering meta search. The idea is to leverage the search engine infrastructures to create a filtering mechanism that automatically helps you get documents for a specialized information need. For instance, we built a buying guide finder that helps you to find just buying guides.
Q: If I hadn't checked out your website , I'd think that everything you do revolves around relevancy and search! A lot of people at Yahoo! don't know that you were part of a German band and that you've composed over 30 rock songs. How do define yourself first; composer or inventor?
A: (laughs) I just like to think about new ideas. So to me, it's all the same thing. You create some music piece or you create some ideas or some algorithms to do something. It doesn't have to be specific to search but ideas related to web technologies in a broad sense.
Q: What's the biggest satisfaction for you in working in Yahoo! Search?
A: I think the satisfaction at the end of the day is that you've invented something that you think is cool and useful and people are able to use it and it helps them simplify things. That's particularly true with the Y!Q project. I think it could be a new paradigm for how user's search. Hopefully if people like it and use it a lot, it'll become the default method for how we search. If that could be achieved, then of course that's kind of a nice thing. You've had some impact essentially--you've developed something people will use now and years to come.
new online video files with Yahoo! Video Search, we ran
across a clip of this guy singing what sounds like a europop song with
the chorus "My Yahoo!, My Yahee!, My Yahoo!, My Yahaha!".
Once we stopped laughing at the glorious webcam kid along with the
cheesy 80s-ness of the song, we just had to share this one with
you. Hope this one amuses you as much as it did us -- the spot-on lip
syncing alone makes this worth the watch.
Yahoo! Search Tips for Webmasters: Saving Bandwidth
If you run a public webserver, you have likely seen our webcrawler, named Slurp, in your logs. Its job is to find, fetch, and archive all of the page content that is fed into the Yahoo! Search engine. We continuously improve our crawler to pick up new pages and changes of your sites, but the flip side is that our crawler will use up some of your bandwidth as we navigate your site. Here are a few features that Yahoo!'s crawler supports that you can use to help save bandwidth while ensuring that we get the latest content from your site:
Gzipped Files: Our crawler supports gzipped files to reduce bandwidth requirements. On average, you will get a 75% savings when you enable compression for your site. Many webservers provide mechanisms for sending out HTML content in a compressed format (for example, mod_gzip for Apache). How much of your site's total bandwidth you can save will depend on how much of your content is compressed and how well it compresses. In general, static pages are good candidates for compression. Any user agent, whether it is a browser or a search engine spider, will let the webserver know it can process compressed content by adding "Accept-Encoding: gzip, x-gzip" to the header of its HTTP request. All major browsers support gzip compressed content. Also you should be happy to know that if our crawler has any trouble with a compressed page, it will re-fetch the uncompressed version. In practice, it does encounter a small percentage of decompression failures.
Smart Caching: Our crawler acts very much like a web cache. Once we grab your content, we hold onto it and keep a history of how it changes over time. We do this for a variety of reasons. One of them is so that we can use HTTP mechanisms designed to help reduce network usage when a client (that's us) repeatedly fetches a web file that has not changed. In particular, our crawler often sends the HTTP If-Modified-Since header (see section 14.25 of rfc 2616) when making repeat requests. If your webserver is setup to recognize this header, it will respond with a 304 HTTP status code instead of a 200 if the content is unchanged. The advantage of this is that a 304 doesn't include your page content, so it uses up less bandwidth than a full 200 response. Again, I'd like to emphasize that our crawler is conservative when it comes to ensuring it has the latest content; it won't use an If-Modified-Since request if it needs to re-fetch your content for any reason.
Most webservers will automatically handle If-Modified-Since requests for static content out of the box. Proper cache control of dynamic content (such as PHP pages and cgi scripts) can be tricky and is an advanced topic. In most cases, servers will play it safe by ignoring If-Modified-Since requests for dynamic content. There are several sites on the web that let you test the cacheability of your web pages. For the purposes of our crawler, pay attention to what they say about the Last-Modified value in your response header.
Crawl-Delay: There's one last trick you can use to help reduce the bandwidth requirements of your site. You can use a special robots.txt directive, crawl-delay, to reduce the speed at which our crawlers make requests to your site. This allows webmasters to manage their bandwidth without restricting content on their site from crawlers and is being used effectively by sites like Slashdot. A safe value for this would be a delay that would allow us to fetch every page on your site in about five days. So a five second delay (crawl-delay: 5.0) would be fine for a site with 2,000 pages, but not for a site with 100,000 or more.
We hope you find these tips for safely saving hosting bandwidth useful and we'd appreciate any feedback, questions or new ideas to further help improve how our crawler interacts with your web sites.
If you've been following Asa's blog, you
probably already know that Firefox is well on its way to 25 million
downloads worldwide. Well Yahoo! has certainly noticed and
believe me, Firefox is very popular here at Yahoo! too.
However, a lot of us also rely on the convenience of Yahoo!
Toolbar, and now we can make the switch to Firefox too. For the last
month, we've been working with a team of Yahoos from around the
company who have provided some Mozilla expertise to build the beta
version of the Yahoo!
Toolbar for Mozilla Firefox.
There are still some rough edges in this beta. Like any Firefox
extension, it may cause your browser to misbehave in unexpected ways.
It's been working well for us on Windows, and we're testing on Mac,
Linux, and FreeBSD as well. See the release
notes for more details.
What's in it?
If you've never used the Yahoo! Toolbar for Internet Explorer,
here's a quick list of some of the cool features:
Easy access to Yahoo! Search
Bookmarks and custom buttons that follow you anywhere
Search This Site to find results just for the current web site
Search History to remember your previous searches
Translate This Page based on the popular Babelfish tools
Mail notification when new Yahoo! Mail arrives
One click access to Yahoo! Games, Finance, News, Sports and any
web site you choose
New Feature: One click "Add to My Yahoo" on sites that provide
Now this isn't the first Firefox extension that ties into
Yahoo!. We recently released
contextual search technology that provides related results on the fly
The support and community that continues to grow around Firefox is
amazing and we're proud to be part of it. In fact, the Yahoo! Toolbar
Beta is just one of many Firefox goodies you can expect to see from us
this year, so stay tuned.
Searching for True Possibilities: A Question From the Edge
What do you believe is true even though you can't prove it?
At The World Question Center, a virtual watering hole for intellectual discovery hosted by the Edge Foundation, a collection of scientists and cognoscenti have gathered to respond to this year's big question, "What do you believe is true even though you can't prove it?"
Edge is the brainchild of John Brockman, "cultural impresario," thought catalyst, and flamboyant literary agent. Brockman is a man who's made a career out of thinking big. He represents rock-star physicists, mathematicians, biologists, cognitive scientists, and authors who straddle many intellectual domains. Brockman believes that asking big questions in a roomful of big minds can yield rich and stimulating discourse, and feed the pursuit of intelligent hunches that benefit all of us.
In his introduction to the 2005 annual question, Brockman refers to the age of "searchculture" and ponders whether, as search technology gets better and better at answering our queries, we can continue to frame the right bright questions. The annual Edge science question smackdown is his contribution to this human quest.
Computer scientist Marti Hearst, one of 120 contributors to this year's Edge exercise, looks at how we use language to pose questions and find answers when we search on the Internet. She believes, but can't prove, that "the Search Problem is solvable"; that elegant, innovative advances in technology will allow people to find the answer to any question that's already been documented and answered in text. For Hearst, understanding queries is key to making Internet search tools more effective.
The Web has made human knowledge publicly accessible via vast electronic repositories of data. Search engines have an endless appetite for this bounty of information (and misinformation) as they ceaselessly crawl and consume the expanding online universe. Computational linguistics, natural language processing, and related technologies uncover rules that help search engines communicate better with us humans.
Hearst's work focuses on algorithms and interfaces that help users locate information without drowning in data. Researchers can discover new, unanticipated answers to unsolved problems through a process known as text mining. Users can find their way through complex information spaces like archives or image galleries, if the interface is designed for a flexible and flowing experience.
As computer scientists discover new ways to "teach" the search engine to respond "intelligently" to patterns in our search behavior, search technology helps us hone in on that elusive needle in the haystack. Occasionally, it even spins us wonderful golden threads we couldn't have imagined in the days before the Web.
Personally, I'm a big believer in serendipitous discovery (in life and in search), and I believe, but can't prove, that serendipity is closely related to the mind's ability to weave truth out of hunches and accidental discoveries.
Meaning out of chaos. Isn't that what science, and for that matter Yahoo! Search, is all about?
Let us know what you believe is true about Search, even if you can't prove it. We welcome your comments and ideas.
(Note: Marti Hearst is currently serving as a consultant and science advisor to Yahoo! Search.)
And speaking of Reiner Kraft, I’ll be sitting down with him shortly to talk about his take on everything from Y!Q to German rock bands. If you have anything you’d like me to ask him, just post it below.
Everytime we launch a new service, someone asks where the idea came
from and why we did it. There's a good story behind this one, so I
thought I'd write it up here.
A bit over a year ago, Jeff ran across a story on Yahoo! News about the #1 song in the
UK during the Christmas holiday. It was the Gary Julesremake
of the Tears
for Fears song Mad World. This caught his interest and
he wanted to know more.
Why is this song so popular? Where can I hear it? Is there a video?
Can I buy it?
He spent the next 30 minutes searching for answers to those questions and
It turns out that the song's popularity had a lot to do with the movie
Darko. The movie became a cult hit and the song was on the
But I digress..
As a result of this experience, he posed a challenge to the Yahoo!
Search team: to build technology that makes it possible to accomplish
tasks like this in just a few minutes.
Not long after that, the team got wind of some contextual search
technology that Reiner
Kraft was building. Now, Reiner is one of our resident geniuses.
When he was at IBM, there was a patent attorney who did little more
than handle Reiner's inventions. MIT's Technology Review even included
him in their TR 100, a
list of technology innovators under the age of 35.
Reiner had realized that the current method of searching isn't always
the most efficent way to get what you're after. Most people aren't
skilled in the art of choosing exactly which keywords to use when
Reiner's technology was designed to help eliminate that problem. The
fundamental idea was to supplement search queries with
context. So instead of having to spend a lot of time
searching and assembling all the information you're after, this
contextual search technology could incorporate that context (the stuff
you were reading at you moment you decided that you wanted to know
more) to find the most relevant results.
The team had a look at what Reiner was doing and immediately realized that
it could be used to meet Jeff's challenge. They asked for a few
tweaks and two days later, Reiner had a working prototype. Excited by
what they saw, the team asked what it'd take to turn it into a full-blown
service. In no time Reiner had the help of some of our best product and engineering folks, as well as one of our DHTML wizards.
Y!Q is also available through the Y!Q
DemoBar, an Internet Explorer toolbar that brings Y!Q's
contextual search functionality to any web page. Simply highlight
some related text on the page you're reading and then perform a