May 02, 2007

Introducing Robots-Nocontent for Page Sections

We recently returned from our annual rendezvous at SES New York and, like always, learned a lot from our webmasters. The ‘Robots.txt Summit’ generated some healthy discussions and support for adding a tag to parts of a page that do not relate to the main content, such as navigation, menus repeated across the entire site, boilerplate text, or even advertising. We heard what people were asking for so we did a little homework and are now happy to introduce the ‘robots-nocontent’ tag.

This tag is really about our crawler focusing on the main content of your page and targeting the right pages on your site for specific search queries. Since a particular source is limited to the number of times it appears in the top ten, it’s important that the proper matching and targeting occur in order to increase both the traffic as well as the conversion on your site. It also improves the abstracts for your pages in results by omitting unrelated text from search result summaries.

To do this, webmasters can now mark parts of a page with a ‘robots-nocontent’ tag which will indicate to our crawler what parts of a page are unrelated to the main content and are only useful for visitors. We won’t use the terms contained in these special tagged sections as information for finding the page or for the abstract in the search results. Note: Using a “nocontent” tag to mark explicit sections of content is not considered “cloaking” because all of the content on the page is available to protect the relevance of the results (unlike “cloaking” where we may be served content that is different from what visitors see).

So for example, the header and boilerplate on Yahoo! Answers might be useful to visitors, but it’s probably not helpful when searching for this particular page. The ‘robots-nocontent’ tag allows you to identify that for our crawler in order to improve the targeting and the abstract for the page.

NoindexExample_cropped.JPG

Applying the “class=robots-nocontent” Attribute:
Here are a few examples of how to apply this attribute for various uses and different syntax options:

    <div class="robots-nocontent"> This is the navigational menu of the site and is common on all pages. It contains many terms and keywords not related to this site</div>

    <span class="robots-nocontent"> This is the site header that is present on all pages of the site and is not related to any particular page</span>

    <p class="robots-nocontent"> This is a boilerplate legal disclaimer required on each page of the site</p>

    <div class="robots-nocontent"> This is a section where ads are displayed on the page. Words that show up in ads may be entirely unrelated to the page contents</div>

We’re rolling out an index update tonight for this change. As usual, you’ll see some changes in ranking along with shuffling of the pages that are included in the index. Let us know what you think and share your thoughts on other forms of support you’d like to see down the road on our suggestion board.

Update: Addressing some comments and questions, with regards to links, the ‘robots-nocontent’ does not in any way affect how links are treated. All links will continue to be used to find targets and will carry attribution to the target if they do not have the ‘rel=nofollow’ tag on them, whether or not they are inside a ‘robots-nocontent’ section.

We deploy various algorithms and mechanisms to understand your website and pages including headers, navigation, footers, etc. However, using this and other markup such as the ‘rel=nofollow’, you can ensure we have more information to understand your site correctly.

On standards, we would be happy to make this into a microformat and are already reaching out to that community. We chose this mechanism because we saw that it was compatible with existing standards and microformats and that makes it easier to gather broader support, including from the other search engines.

Priyank Garg
Yahoo! Search

Comments

  1. Thanks for this feature, Yahoo!

    Can you please clarify what will happen to links contained within class=”robots-nocontent”? Will they be followed?

  2. This is a very cool feature! Let’s hope the others will follow.
    I do have one question though…
    Maybe this is a strange/stupid question, but why use a property of a tag (class=”") instead of an actual tag? It seems a bit confusing to me.

    Thx,
    Joost

  3. …because inventing a new tag would mean violating the (X)HTML specification (unless of course it used a new namespace).

    It would be good to see this class-based attribute standardised across the search engines, perhaps via the microformats community…

  4. Thx Frankie, that makes sense. Feeling like a little kid now, but glad to understand. Will shut up now.

  5. This new tag does not change any treatment of inlinks from the page. Links within the section marked with ‘robots-nocontent’ will be treated just like links in the rest of the page.

    They will continue to be actually crawled to find the target page, but they will not carry link attribution if they have the ‘rel=nofollow’ tag.

    Hope that answers your question, Alan.

    And Frankie, thanks a lot for explaining so precisely our choice of mechanism. We are also hopeful the community and other search engines will adopt this.

  6. Since when are search engines dictating semantics in HTML?

    Establishing a robots only class markup is not the answer.

    Why not detect classes that are *extremely* common (eg. footer, header, nav!) and actually work on figuring this out on the search engine side?

    Search engines should not be meddling with my markup.

  7. Which of Yahoo’s crawlers does this apply to?

    Can I do this with my YPN ads too?

  8. This seems like extra non-standardized code bloat. It should be the search engines’ task to determine non-changing templated regions of a page.

    The header and nav are important for indexing — they contain alt text and link text that defines the overall theme of a site. This tag is implying that Yahoo doesn’t have the ability to determine what are templated regions of a page. If you do have that ability, why add complexity?

    The abstract in the search results should be taken from the meta description or a snippet of text that matches the query.

  9. Thanks for sharing this information. I will try to use it.

  10. Bad idea. Why class? Why to do it for each particular nonrelevant part?
    It is MUCH WISER to do it e.g. using special comments as with google adwords, so you do not have to then make e.g. a special div container
    an why class? What if i use more classes? will it recognise class=”myclass1 robots-nocontent otherclass3″? Why do you not use rel? if you wanted to do it this way, you could do [p rel="nocontent" or [p rel="nocontent nofollow"] if you want to not index nor follow the links in that particular paragraph.
    And last – please could you negotiate such things with other big players – msn & google at least? You did it with sitemap and that is great. Such yahoo only features will be much less used, and you will be the ones who will loose. No developer will use several different ways of marking relevantness.

  11. If search engines really need help to know which page sections to index, they ought to let us give it using css or xpath selectors in the robots.txt file, not cluttering up our markup. (see: http://www.standardzilla.com/2007/05/03/yahoo-meddles-with-no-content-class/)

  12. and one more – i would prefer more tag for marking some part “content” than “nocontent”. In most webs, you have one div content, plus maybe some other. If you do not use special divs for the other parts, it would be problematic to write everywhere “nocontent”. But anyway i would prefer something like

  13. @noname:

    • As you demonstrate, class is a space-separated list. Therefore as a webmaster I would presume that robots-nocontent will respect that as part of the HTML spec. Therefore really, you won’t need to add any extra DIVs or Ps or anything, because you can tack this new class just onto existing elements.

    • It would be complete inappropriate to use the REL attribute, as that describes the relationship between two pages. It is only applicable on LINK or A elements (where an HREF can be followed).

  14. What will happen with links, text and images in NoConteng tag?
    Can you explain in DETAILS explain how yahoo will use information in NoContent?
    And is there any REAL reasons to rewrite scripts and templates to use this tag on our sites?
    The only one I see right now – is the better looking SERPS snippet.

  15. I appreciate your efforts to support unsearchable page areas. I just think that abusing the class attribute this way creates a whole lot of a mess. Think of static sites and editorial contents stored as HTML. Webmasters of legacy sites who want to use this feature must edit each and every page.

    Why don’t you support CSS-like syntax in robots.txt? Other engines would just ignore statements like

    A.advertising { rel: nofollow; } /* devalue affiliate links */

    or

    DIV.hMenu, TD#bNav { content:noindex; rel:nofollow; } /* make site-wide links unsearchable */

    Please consider a revamp!

    Thanks
    Sebastian

  16. Why not instead use robots.txt to indicate a class, or a list of classes that would be construed to be nocontent.

    An alternate meta tag could be created for those that prefer to do it that way. It just seems like forcing the usage of a particular class title isn’t the best approach for the long haul as it muddies the waters. Search engine directives should stay in robots.txt or in the metas whenever possible.

  17. Just to add my opinion:

    I’m not a big fan of changing my markup for this kind of thing. The robots.txt file is made for this so I like Sebastian’s suggestion of adding a class and/or id of DOM elements to the robots.txt file. My users shouldn’t have to download additional text to accommodate the search engines. The robots.txt file is only used by search engines. Funny that this came up in the robots.txt conference and this wasn’t the outcome..

  18. Robots.txt is concerned with controlling the crawling of a page, not the indexing of its contents. As such, it is probably not the appropriate place for this kind of feature.

    Robots (such as Slurp) aren’t necessarily written to read external CSS or JavaScript files and apply them to a page’s content. Given the inherent nature of a robot, I think Yahoo’s solution is probably the most practical. It is also the simplest to document.

  19. If you want to find out what is and is not the main content of a Web page, it’d seem like it’d make the most sense to advocate for the furthering a long of the XHTML role attribute.
    http://www.w3.org/TR/xhtml-role/#s_role_module_attributes

  20. It would make more sense, given that the majority of web pages have one content section and many non-content sections (headers, footers, sidebars, etc.) to have a class called ‘robots-content’.

    That way you would only have add the class to one element instead of 5 or 6…

    And, looking at the statistical results that google did a while back ( http://code.google.com/webstats/2005-12/classes.html ), you could probably just search for a class called ‘content’ without anyone else doing any extra work ;-)

  21. Oh dear, someone please help asap. Something happened! Just lost all listings on the se results pages for my website. advicediva.com Not sure what happened. I hope this is fixed soon. I went from plenty of number one search engine results to none or maybe one. This is my home business and I need these placements! What on earth happened? Honestly…I just gave birth to my first child last Friday night and suddenly I have no sales coming in. Oh my word, I hope someone can help. I have a newborn and I really really need help, no joke, suddenly single mom here. Why did this happen to me? I mean, either I dropped completely in rankings from number one to nothing or I have been removed altogether? Can someone plese fix this asap? I am in real trouble.

  22. Was there really a problem that deemed this necessary? It seems like the search engines were all ready doing a pretty good job determining relevant pages.

    I could be wrong, but it really does seem like ‘code bloat’.

  23. I need to repost because I really need help and to get someone’s attention. I really need help and I don’t know what to do. All of my pages were removed during this last update for http://www.advicediva.com I have spent years on this site and it is my sole source of income. I have tried requesting rereviews and sending in requests for help but I keep getting the same response that my website MAY not be in compliance and they they can’t give me specific information. I have changed averything I can possibly think of which could be wrong and resubmiited, same response. But this is a desperate situation for me. I am not an seo master and this website is all I have, I am a work at home mom…now! I just gave birth on Friday to my first child and my website has always sustained me. Not a huge income but okay. Yahoo has always put my pages up everytime I write a new one. But now, just a few days after I give birth I lost all of my pages. I really don’t know what to do or who to talk to, I need something done as soon as possible. I had not one sale today and that is really bad. I probably sound like a freak, but I am sorely concerned for teh sake of my son. I have no spam on my site, no strange seo tactics, no cloaking, nothing. I have affiliate links in the form of ads for extra income but all content in unique and my own. I have gone through every single possible thing I can think of and I am at a loss. Can someone please help for the sake of my new son? I can not afford to wait several weeks to be back online, I just can’t. I have a child now!

  24. Im no expert but I can see a flaw with the use of the class attribute for implementing robots-nocontent as shown: class=”robots-nocontent”.

    Any webmaster will have to update their CSS to robots-nocontent, example:

    yadda… yadda..

    If I will use that, then I should update it to:
    yadda… yadda..

    Update my CSS from .MyHoriMenuBlock to .robots-nocontent

    I agree with the use of W3C’s XHTML Role Attribute Module -> http://www.w3.org/TR/xhtml-role/

    As we can’t use rel=”robots-nocontent” if we already have rel=”nofollow” for example on comment pages.

    Again, I’m no expert and I could be wrong.

  25. Good to see Yahoo coming out first for something of this kind. Now it’s left to be seen when will Google and MSN unite on this one too. Like they did on sitemap protocol 0.9

  26. @JC John
    The class value is a space delimited list, that is you can assign multiple class names to HTML elements:
    class=”MyHoriMenuBlock robots-nocontent”

  27. This is truely unbelievable. What are they thinking. They just kicked over the first domino that will counteract with over billions of other dominos out there, of which dropping at a phenomal rate when they deserved to stand and stand tall. These sites that are dropping were in perfect placement of where they belonged. For this change in Yahoo’s alg’s – I must say, someone is off there rocker and highly needs to re-consider the “undoing” of this change immeddiately. I have over 300 sites I manage as a web developer and I can honestly say this is the worst change in the history of Search Engines. I hope you sleep well at night Yahoo!

  28. I look to the day when the Big3 (remember when that used to mean Ford, Chevy…) adopt a universal method for on page span or divsional noindex/no-follow/etc. With mashup apps becoming main stream this will become a truly necessary webmaster tool. Yahoo! on Yahoo! for taking the leap first!

  29. Personally, I like the idea about robots.txt having xpath or css like syntax to facilitate adoption of this tag on your site. I have some concerns about the fact that robots.txt is generally very stable and does not have code. And this approach changes that. But we’ll think about that and bring it up with the other search engines in our conversations around collaborating on robots.txt

    With regards to ‘robots-content’ proposals, we considered it but felt that it can be confusing since by default all content on your page is to be indexed. If we go with a ‘robots-onlycontent’ type of semantic that is more dangerous because someone can inadvertently exclude content that is useful to the page. We decided to start with the more explicit ‘robots-nocontent’ which has no unintended consequences to adjacent sections of the page. However, we are open to feedback and do what makes sense.

    Thanks for all the feedback, please keep it coming.

  30. Yahoo! has been a leader in using microformats. I think this is a great extension to the platform.

    For those worrying about abusing classes, you really should visit microformats.org to learn more about how you can easily add valuable information and functionality to your sites using the class attribute.

    I just wish you hadn’t used the term “tag”, that was misleading from the get-go.

  31. @Ted Drake: I say this with *absolutely no* negative connotation towards the Y! Search guys or this development, but just for sake of clarify. ‘robots-nocontent’ is *not* a microformat; it’s just a class name. Whilst there is a similar robots-exclusion microformat (a stagnant draft) on the microformats wiki, robots-nocontent is not a part of it.

    Whilst I’m sure someone will clearly spec. this class name somewhere, we are trying to be careful to avoid the generalisation of the term ‘microformat’ right now, since it should not become synonymous with ‘semantic class name’.

  32. Agh, I should clarify something for other commentors:

    “we are trying to be careful to avoid the generalisation of the term ‘microformat’”

    When I say ‘we’, I refer to the microformats development community and not to Yahoo! (I work in a completely separate capacity to Yahoo! Seach; my comments should be read as independent viewpoints). Thanks.

  33. The only problem with this,is the all or nothing aproach……

    another option would be a new tag to make a section ‘deprioritized’ in the serps….

    SearchEnginesWeb would like to propose a
    * SUBCONTENT * tag – that would allow a given section to be spidered, but make it LOW in the importance level of that Webpage

  34. This is a very cool feature! Let’s hope the others will also follow Yahoo

  35. Good features. I alway think that some content are on my webpage, but I do not want to follow them. Now yahoo have solved this problem by “robots-nocontent” feature. Thanks for improving yahoo features.

  36. Some reasons not to use nocontent:

    * Confusion among webmasters on how to correctly use it (like the confusion about rel=nofollow and meta tags). It will not be implemented correctly on many Web sites.
    * Only some sites will use it which means spiders will have to do what they have always done and seperate template from content.
    * Just one more level of complexity in building a Web site.
    * Possibilities for abuse?

    I think it would be interesting to standardize a way to mark navigation elements in XHTML — for example *div class=”hNav”* — but making a class specifically designed for the user to control search engine indexing between sections of a single page is not a good idea, especially when it is just designed for a single search engine. Please reconsider.

  37. One more addition to what I wrote above:

    If there were a microformat for marking up navigation sections, for example class=”hNav” — search engines could use that data when indexing pages, but it would also serve other purposes such as telling mobile devices to collapse those nav sections when displaying them. It just seems like nocomment is not the best way to go about this…

  38. Kudos to the concept but I’m not sure I like the implementation.

    You should have implemented it like Google Adsense section targeting.

    Basically use HTML comments so that the begin and end are CLEARLY identifiable by a robot so that HTML parsing issues don’t come into play.

    Can we fix this please?

  39. Just looked at some templates of my site and simply thought “Oh no, I won’t do that”!

    Re. class=”robots-nocontent a b c”: divs and spans are mixed and nested for design and not content reasons AND I also do pretty much prefer the HTML comment approach!

    IF you are doing it via class arguments THEN you should also crawl CSS definitions. THERE is an easy place to define ‘robots-nocontent’ site wide.

  40. Um, so if i had lots of different content i would get more traffic?

  41. Thanks for sharing this information. I will try to use it.

  42. I wonder if Robots-Nocontent is applied for Yahoo! Japan.
    (I’m a Japanese blogger living in Japan)

  43. I like this. Using HTML for will push SEO’s to reply.

  44. Well thanks a lot Yahoo. My site which I pay to have indexed in the yahoo directory just lost every single ranking! Great. And I have no clue how to get it back. Oh and whats best of all is that the sites that now have my place on the rankings are useless sites that have barely any backlinks even.

    My site has been around for 6 years as one of the best modeling advice sites online. Google has me as top as well as msn. Now I can’t even find my site on yahoo?

    Whats going on? I don’t even know how to fix it. Can someone give me a plain english answer> or put it back to the old way where well deserved sites have the top spots.

    My hole life income is online so it is quite important. I just don’t know how to fix it. Please help me out here so I can at least get back up there.

    Thanks

  45. Thanks so much for the new feature!

  46. You say you did “a little homework”, but I wish you had done more. Go to this posting to see a list of selective indexing directives which have already been implemented: http://wunderwood.org/most_casual_observer/2007/05/selective_page_indexing_direct.html

    The “don’t index” sense of this directive is known to be confusing. That could have been fixed with more than “a little” homework.

  47. I don’t know if diversity of content should be enough to drive more traffic to a website (which is what this seems to amount to). I guess you don’t have to use the feature if you don’t like it, but it won’t exactly be easy to implement if you do like it.

  48. Since when search engines start dictating the search result ?

  49. Robot non-content should be applied Internaional Yahoo sites as well.

  50. I like controlling this in robots.txt but I don’t think that CSS or Xpath is a good idea, it would be rare that you would need to be so specific.

    Simply providing the class or id in a new directive would be the best bet. Something like:

    Nocontent-class: advertising
    Nocontent-id: nav
    Nocontent-id: footer

  51. Can the class=”robots-nocontent” be applied to a as well as a div or span?

  52. Does this accomplish the same thing as Google AdSense section targetting? Just a lot more bytes?

  53. Ok, what if the links in the nav area are important? should they too be excluded? And if so, do I do that in the template so even the index page’s nav content is ignored?

    Finally I am not sure, does belong actually on the page? “Less than” td class=”robots-nocontent” “Greater than ” if the nav is in a table cell?

    thanks

    Michael