Fighting Duplication: Adding more arrows to your quiver
Avoiding duplicates in the search engine index has consistently been a key concern we’ve heard from webmasters and site owners. Over the last few years, we have made significant strides in finding duplicates in our crawler and index algorithmically and provided webmasters with better tools for controlling these. Today we are announcing our support for a new HTML tag, the <link> tag, which helps reduce duplicates by documenting the preferred URL form to access each page.
When you use the <link> tag, you can indicate the canonical URL form for crawlers to use for each page of content, no matter how it was retrieved. This puts the preferred URL form with the content so that it is always available to the crawler, no matter which session id, link parameter, sort parameter, parameter order, or other source of variance is present in the URL form used to access the page.
To do this, specify a <link> tag in the <head> section of your page content:
<link rel=”canonical” href=”http://www.example.com/products” />
The above tag indicates to the crawler that the URL it is present on should be represented canonically as http://www.example.com/products. This would eliminate the following duplicates:
A few technical details:
• The URL paths in the <link> tag can be absolute or relative, though we recommend using absolute paths to avoid any chance of errors.
• A <link> tag can only point to a canonical URL form within the same domain and not across domains. For example, a tag on http://test.example.com can point to a URL on http://www.example.com but not on http://yahoo.com or any other domain.
• The <link> tag will be treated similarly to a 301 redirect, in terms of transferring link references and other effects to the canonical form of the page.
• We will use the tag information as provided, but we’ll also use algorithmic mechanisms to avoid situations where we think the tag was not used as intended. For example, if the canonical form is non-existent, returns an error or a 404, or if the content on the source and target was substantially distinct and unique, the canonical link may be considered erroneous and deferred.
• The tag is transitive. That is, if URL A marks B as canonical, and B marks C as canonical, we’ll treat C as canonical for both A and B, though we will break infinite chains and other issues.
For several years, we have had a clear policy on handling redirects that allows you to take control of how crawlers and browsers relate between pages on your site. Another useful tool for eliminating spurious dynamic URLs and avoiding content duplication is the Rewrite Dynamic URLs feature of Site Explorer. All you need to do is authenticate your site in Site Explorer, which can now be done instantly, and then create a URL Rewriting rule. The benefit of this approach is that Yahoo! does not need to crawl your duplicate pages to discover the canonical relationships. The <link> tag provides you with another resource to use, and is also being supported by our other partners in the Sitemaps effort, Google and Microsoft.
We recommend that you structure your site with normalized URLs and minimum duplication, or use 301s if need be. If those don’t work for you, try Site Explorer and/or the <link> tag. Our support for the <link> tag will be implemented over the coming months. Let us know if you have any questions on our Site Explorer Suggestion Board.
Director Product Management
- 79 Comments