Fighting Duplication: Adding more arrows to your quiver
- Posted February 12th, 2009 at 12:29 pm by Yahoo! Search
- Categories: Search, Search Tips
Avoiding duplicates in the search engine index has consistently been a key concern we’ve heard from webmasters and site owners. Over the last few years, we have made significant strides in finding duplicates in our crawler and index algorithmically and provided webmasters with better tools for controlling these. Today we are announcing our support for a new HTML tag, the <link> tag, which helps reduce duplicates by documenting the preferred URL form to access each page.
When you use the <link> tag, you can indicate the canonical URL form for crawlers to use for each page of content, no matter how it was retrieved. This puts the preferred URL form with the content so that it is always available to the crawler, no matter which session id, link parameter, sort parameter, parameter order, or other source of variance is present in the URL form used to access the page.
To do this, specify a <link> tag in the <head> section of your page content:
<link rel=”canonical” href=”http://www.example.com/products” />
The above tag indicates to the crawler that the URL it is present on should be represented canonically as http://www.example.com/products. This would eliminate the following duplicates:
http://www.example.com/products?trackingid=feed
http://www.example.com/products?sessionid=hgjkeor2
http://www.example.com/products?printable=yes&trackingid=footer
A few technical details:
• The URL paths in the <link> tag can be absolute or relative, though we recommend using absolute paths to avoid any chance of errors.
• A <link> tag can only point to a canonical URL form within the same domain and not across domains. For example, a tag on http://test.example.com can point to a URL on http://www.example.com but not on http://yahoo.com or any other domain.
• The <link> tag will be treated similarly to a 301 redirect, in terms of transferring link references and other effects to the canonical form of the page.
• We will use the tag information as provided, but we’ll also use algorithmic mechanisms to avoid situations where we think the tag was not used as intended. For example, if the canonical form is non-existent, returns an error or a 404, or if the content on the source and target was substantially distinct and unique, the canonical link may be considered erroneous and deferred.
• The tag is transitive. That is, if URL A marks B as canonical, and B marks C as canonical, we’ll treat C as canonical for both A and B, though we will break infinite chains and other issues.
For several years, we have had a clear policy on handling redirects that allows you to take control of how crawlers and browsers relate between pages on your site. Another useful tool for eliminating spurious dynamic URLs and avoiding content duplication is the Rewrite Dynamic URLs feature of Site Explorer. All you need to do is authenticate your site in Site Explorer, which can now be done instantly, and then create a URL Rewriting rule. The benefit of this approach is that Yahoo! does not need to crawl your duplicate pages to discover the canonical relationships. The <link> tag provides you with another resource to use, and is also being supported by our other partners in the Sitemaps effort, Google and Microsoft.
We recommend that you structure your site with normalized URLs and minimum duplication, or use 301s if need be. If those don’t work for you, try Site Explorer and/or the <link> tag. Our support for the <link> tag will be implemented over the coming months. Let us know if you have any questions on our Site Explorer Suggestion Board.
Priyank Garg
Director Product Management
Yahoo! Search
- 79 Comments
- Subscribe
I’ve got Drupal, Magento and WordPress plugins ready for this feature: http://yoast.com/canonical-url-links/
This is HUGE, long have been in an endless battle of avoiding duplicate content via our partner network.. Thank you yahoo!
This is a very good idea to avoid pages that have unnecessary and redundant parameters passed to it, which form complex urls.
This is great news. It’ll solve many problems with dinamical content websites.
This is all well and good, but the Site Explorer API has been down since December.
This seems like a major step forward and I don’t see any downside to it.
Just the ticket to make life LOTS easier for those who want to cooperate with search engines (everyone).
Q: is a 301 redirect to the canonical URL redundant if a page is using the link tag?
Really that was well idea pertaining that has to abuse some spam link
Keep going.
Not a good idea: avoiding duplicates in the search engine index must be done in deleting duplicate resources from your index.
You do not have to palliate the bad URL and HTML implementations of webmasters, in particular if they are not able to know, understand and apply (URL) standards.
I applause this initiative from the big G, Y! and MSN (aka Live), it’s a big help for e-commerce sites that live with this issue and becomes time consuming and expensive to support and solve.
This is a great idea. It will save me trying to use .htaccess to rewrite all the weird version urls that come into our site and the utm_campaign variables.
It’ll solve many problems with dinamical content websites.
LOTS easier for those who want to cooperate with search engines
Good information, most people don’t know how much duplicate content can harm search results and user experience.
This is a bad idea, because it will hender those of us with “The Same Relevant Message” we’re trying to share with as many people as we can in different social networking communities on the web.
Duplicating that same message, as did, Dr.Martin Luther King, all of our Presidents, Congressmen, and women, and anyone with an agenda, whether it’s political, or business, which is what franchising is all about, “duplication,” and that’s the way the search engine must observe some of the content of those like me with a real redundant message.
This seems like a major step forward
i like this tag)
It may have been nicer to have borrowed the form from Atom rather than creating a new rel type.
Hey Priyank, I use WordPress, /category/ and /tag/ generally crawl by search engines apart from the post pages, so by using “canonical” link can we prevent ourselves from content duplication?
It’ll solve many problems with dinamical content websites.
It will save me trying to use .htaccess to rewrite all the weird version urls that come into our site and the utm_campaign variables.
I have one absolutely burning question about this tag:
If I include it on a page which has a meta robots tag of “noindex”, and point it to a canonical variant of this page (which can be indexed), does this cause any problems?
Essentially, we use meta robots “noindex, follow” for things like pagination, different sorting order of products, etc etc – this handles the duplicate content issue (and much better than robots.txt, from a site-owner’s perspective).
What I want to make sure is that, if I include this new rel=canonical tag, that search engines that don’t handle this new tag can handle the “noindex” tag to eliminate duplicate content that way and search engines which do use the canonical tag are correctly supported.
This is the single most important thing I need to know about this new tag.
The second most important thing is – is the behaviour of the above standardised with the other search engines which are using it too?
This is fantastic
we use meta robots “noindex, follow” for things like pagination, different sorting order of products, etc etc – this handles the duplicate content issue (and much better than robots.txt, from a site-owner’s perspective).
I’ve modified discuz for canonical URL link: http://www.shyedu.net/it-website/discuz-URL-canonicalization-128.html
This sounds great, but has anyone done any testing, yet? I mean, are dup URLs actually being *removed* from indexes by this? How is this going to affect manipulative duplicate content alogos?
Canonicalization issues should be addressed in planning and development and are very easily avoided when you’ve structured your website appropriately. Keyword research, develop, deploy. I just don’t trust this *one* tag (anyone remember metas?) to resolve the issues, entirely; it’s up to programmers to program accordingly. Dynamic 404s and strict URL structuring is an extremely effective, preemptive technique that people aren’t using as it is. What happens when this tag gets abused or deployed incorrectly?
Will this tag actually have any effect on ‘big’ sites that *don’t* implement this technique?
I need to understand the reward and penalty structure of this tag, in direct reference to white hat, and black hat, policies; and what Search Engines have in mind for this consideration.
This will be interesting to watch unfold over the next several months…
Arow
Arow,
Thanks for the insight into the tag.
I would like to know the implications too. I am a newbie to web development and In fact I was considering bypassing doing the research on the mod rewrites in php application that i am developing.
I have posted a few questions(as below) other forums too :
Does this help in how google sees the dynamic urls ?
With a not so adequate knowledge about how google sees the Dynamic URLs as not so google friendly, I was looking for all the information that is available for changing the dynamic URLs to the Static ones.
I am not sure if this tag saves all the research that I was about to do starting from the .htaccess files to the MOD rewrites for the php applications. OR is this tag really a substitute for that , Anyone – Any comments on that would be much appreciated.
Thanks,
Arun – Web developer ,
Proxima Systems India.
It may have been nicer to have borrowed the form from Atom rather than creating a new rel type.
I am not sure if this tag saves all the research that I was about to do starting from the .htaccess files to the MOD rewrites for the php applications.
Finally!! Useful Tag…Just use it now! Thanks
It is good to see this tag out. It will definitely solve a lot of problems for Webmasters everywhere. Good job!
I hope this will help with our Yahoo store!
This tag will really helpful for the canonical problem which exist in the most of the sites.
Quite helpful. We have implemented it on few web sites but still the results are not very encouraging. Let’s see how it will behave further.
I would like to know the implications too. I am a newbie to web development and In fact I was considering bypassing doing the research on the mod rewrites in php application that i am developing.
This is HUGE, long have been in an endless battle of avoiding duplicate content via our partner network.. Thank you yahoo!
This tag will really helpful for the canonical problem which exist in the most of the sites.
Should be addressed in planning and development and are very easily avoided when you’ve structured your website appropriately. Keyword research, develop, deploy. I just don’t trust this *one* tag (anyone remember metas?) to resolve the issues, entirely; it’s up to programmers to program accordingly. Dynamic 404s and strict URL structuring is an extremely effective, preemptive technique that people aren’t using as it is.
Seeing as my original post (arows1faith; Feb 24th, 2009) hasn’t had any a/b testing replies, yet – and this article is still quite visible – I was wondering if anyone had any “I done gone and proved it” data to share?
I haven’t seen a difference in using this tag, alone. Combined with and on-page linking there is a significant difference, but nothing to show that this tag – by itself – is doing anything….
Arow
I apologize for including {code} in my previous reply….
The second paragraph should read:
“I haven’t seen a difference in using this tag, alone. Combined with the {title} tag and on-page linking there is a significant difference, but nothing to show that this tag – by itself – is doing anything…”
Does the canonical link tag really works on yahoo? I decided to add this functionality on my site 3 months ago to minimize the coding but it seems the old url still exist and still indexed for example http://www.dressupdollgames.net/index.php?params=game/332/ , where it should be http://www.dressupdollgames.net/game/332/Roiworld-Girl-Dress-Up-Game-20.html . I have the tag place correctly on my site.
This sounds great, but has anyone done any testing, yet? I mean, are dup URLs actually being *removed* from indexes by this? How is this going to affect manipulative duplicate content alogos?
Canonicalization issues should be addressed in planning and development and are very easily avoided when you’ve structured your website appropriately. Keyword research, develop, deploy. I just don’t trust this *one* tag (anyone remember metas?) to resolve the issues, entirely; it’s up to programmers to program accordingly. Dynamic 404s and strict URL structuring is an extremely effective, preemptive technique that people aren’t using as it is. What happens when this tag gets abused or deployed incorrectly?
Will this tag actually have any effect on ‘big’ sites that *don’t* implement this technique?
I need to understand the reward and penalty structure of this tag, in direct reference to white hat, and black hat, policies; and what Search Engines have in mind for this consideration.
This will be interesting to watch unfold over the next several months…
This tag will really helpful for the canonical problem which exist in the most of the sites.
I would like to know the implications too. I am a newbie to web development and In fact I was considering bypassing doing the research on the mod rewrites in php application that i am developing.
This is HUGE, long have been in an endless battle of avoiding duplicate content via our partner network.. Thank you yahoo!
This tag will really helpful for the canonical problem which exist in the most of the sites.
Seeing as my original post (arows1faith; Feb 24th, 2009) hasn’t had any a/b testing replies, yet – and this article is still quite visible – I was wondering if anyone had any “I done gone and proved it” data to share?
I haven’t seen a difference in using this tag, alone. Combined with and on-page linking there is a significant difference, but nothing to show that this tag – by itself – is doing anything….
Arow
Duplicate content is a headache. The most problem encountered is with the codes in wordpress blogs, you started off with no intention of duplication but ended up with duplication issues with the categories and tags. Duh!
Good information, most people don’t know how much duplicate content can harm search results and user experience. This tag will really be helpful for the canonical problem which exist in the most of the sites.
This is a very good idea to avoid pages that have unnecessary and redundant parameters passed to it, which form complex urls.
The canonical link is a great step forward in fighting duplication. Thanks for the info on how to implement the canonical link tags.
Can’t figured out how to use the Rewrite Dynamic URLs feature of Site Explorer. Where do I authenticate my site in Site Explorer…hmmm Though I must admit that the is very much easy to use for the canonical solutions.
Where do I authenticate my site in Site Explorer…hmmm Though I must admit that the is very much easy to use for the canonical solutions.
thanks… but nothing to show that
How is this going to affect manipulative
thanks
This tag will really helpful
What if your links are currently divided between the www. and non on yahoo site explorer. Will this meta tag fix the problem and help me avoid going backwards and trying to fix all these links 1 by 1?
Does the canonical issue still apply? I prefer plugins that address this, so much better! :)
Stretching plays an important part in rehabilitation from plantar fasciitis. By doing regular plantars fasciitis exercises, you can avoid endoscopic plantar fasciitis treatment which can lead to plantar fasciitis numbness. A course of plantar fasciitis excercises are clearly a preferred fasciitis treatment.
Is there any other way to have this done, this is a great idea when you have only a couple of urls with duplicate content but, what if you have several urls, my site has a backround color variation feature, this actually changes the url parameters but the content is the same, how can I solve this? the .htaccess also can’t apply here: too many urls . Any help will be great, thanks.
I think this tag won’t help for the canonical problem.
Eversince, this canonical issue came out, we have included and been applying this to all of our sites. It stated that it would eliminate different url structure with the same destination, thus consolidating link juice. This is great stuff, Im thinking – whats next?
Cheers,
Nice information. Most people don’t know how much duplicate content can harm search results and user experience. This tag will really be helpful for the canonical problem which exist in the most of the sites. Thanks for sharing.
As someone who has continued to optimize his own site, I do agree that duplicate content can hurt. The tag is very easy to use, and could be extremely helpful in the future.
This is a great idea. It will save me trying to use .htaccess to rewrite all the weird version urls that come into our site and the utm_campaign variables.
This may solve a lot of duplication problems.
Most people don’t know how much duplicate content can harm search results and user experience.
This code is very useful, duplicate content is an ongoing issue.
The canonical link is a great step forward in fighting duplication. Thanks for the info on how to implement the canonical link tags.
This seems like a major step forward
The canonical link is a great step forward in fighting duplication. Thanks for the info on how to implement the canonical link tags.
Many people do not know a lot of duplicate content can harm your search results and user experience.
Many people do not know a lot of duplicate content can harm your search results and user experience.
duplicate content is an ongoing issue.
Many people do not know a lot of duplicate content can harm your search results
great step forward in fighting duplication. Thanks for the info on how to implement
We always enjoy watching.
We always enjoy watching. thanks
Many people do not know a lot of duplicatealways enjoy watching.
always enjoy watching