Tuesday, July 31, 2012

Introducing the Structured Data Dashboard

Webmaster level: All

Structured data is becoming an increasingly important part of the web ecosystem. Google makes use of structured data in a number of ways including rich snippets which allow websites to highlight specific types of content in search results. Websites participate by marking up their content using industry-standard formats and schemas.

To provide webmasters with greater visibility into the structured data that Google knows about for their website, we’re introducing today a new feature in Webmaster Tools - the Structured Data Dashboard. The Structured Data Dashboard has three views: site, item type and page-level.

Site-level view
At the top level, the Structured Data Dashboard, which is under Optimization, aggregates this data (by root item type and vocabulary schema).  Root item type means an item that is not an attribute of another on the same page.  For example, the site below has about 2 million Schema.Org annotations for Books (“”)

Itemtype-level view
It also provides per-page details for each item type, as seen below:

Google parses and stores a fixed number of pages for each site and item type. They are stored in decreasing order by the time in which they were crawled. We also keep all their structured data markup. For certain item types we also provide specialized preview columns as seen in this example below (e.g. “Name” is specific to Product).

The default sort order is such that it would facilitate inspection of the most recently added Structured Data.

Page-level view
Last but not least, we have a details page showing all attributes of every item type on the given page (as well as a link to the Rich Snippet testing tool for the page in question).

Webmasters can use the Structured Data Dashboard to verify that Google is picking up new markup, as well as to detect problems with existing markup, for example monitor potential changes in instance counts during site redesigns.

Friday, July 27, 2012

New notifications about inbound links

Webmaster level: Advanced

Lots of site owners use our webmaster console to see how their site is doing in Google. Last week we began sending new messages to sites with a pattern of unnatural links pointing to them, and I wanted to give more context about these new messages.

Original Link Messages 

First, let's talk about the original link messages that we've been sending out for months. When we see unnatural links pointing to a site, there are different ways we can respond. In many severe cases, we reduce our trust in the entire site. For example, that can happen when we believe a site has been engaging in a pretty widespread pattern of link spam over a long period of time. If your site is notified for these unnatural links, we recommend removing as many of the spammy or low-quality links as you possibly can and then submitting a reconsideration request for your site.

In a few situations, we have heard about directories or blog networks that won't take links down. If a website tries to charge you to put links up and to take links down, feel free to let us know about that, either in your reconsideration request or by mentioning it on our webmaster forum or in a separate spam report. We have taken action on several such sites, because they often turn out to be doing link spamming themselves.

New Link Messages 

In less severe cases, we sometimes target specific spammy or artificial links created as part of a link scheme and distrust only those links, rather than taking action on a site’s overall ranking. The new messages make it clear that we are taking "targeted action on the unnatural links instead of your site as a whole." The new messages also lack the yellow exclamation mark that other messages have, which tries to convey that we're addressing a situation that is not as severe as the previous "we are losing trust in your entire site" messages.

How serious are these new link messages? 

These new messages are worth your attention. Fundamentally, it means we're distrusting some links to your site. We often take this action when we see a site that is mostly good but might have some spammy or artificial links pointing to it (widgetbait, paid links, blog spam, guestbook spam, excessive article directory submissions, excessive link exchanges, other types of linkspam, etc.). So while the site's overall rankings might not drop directly, likewise the site might not be able to rank for some phrases. I wouldn't classify these messages as purely advisory or something to be ignored, or only for innocent sites.

On the other hand, I don't want site owners to panic. We do use this message some of the time for innocent sites where people are pointing hacked anchor text to their site to try to make them rank for queries like [buy viagra].

Example scenario: widget links 

A fair number of site owners emailed me after receiving one of the new messages, and I think it might be helpful if I paraphrased some of their situations to give you an idea of what it might mean if you get one of these messages.

The first example is widget links. An otherwise white-hat site emailed me about the message. Here's what I wrote back, with the identifying details removed:

"Looking into the very specific action that we took, I think we did the right thing. Take URL1 and URL2 for example. These pages are using your EXAMPLE1 widgets, but the pages include keyword-rich anchortext pointing to your site's url. One widget has the link ANCHORTEXT1 and the other has ANCHORTEXT2. 

If you do a search for [widgetbait matt cutts] you'll find tons of stories where I discourage people from putting keyword-rich anchortext into their widgets; see for example. So this message is a way to tell you that not only are those links in your widget not working, they're probably keeping that page from ranking for the phrases that you're using." 

Example scenario: paid links 

The next example is paid links. I wrote this email to someone:

"I wouldn't recommend that Company X ignore this message. For example, check out SPAMMY_BLOG_POST_URL. That's a link from a very spammy website, and it calls into question the linkbuilding techniques that Company X has been using (we also saw a bunch of links due to widgets). These sorts of links are not helping Company X, and it would be worth their time to review how and why they started gathering links like this." 

I also wrote to another link building SEO who got this message pointing out that the SEO was getting links from a directory that appeared to offer only paid links that pass PageRank, and so we weren't trusting links like that.

Here's a final example of paid links. I emailed about one company's situation as follows:

"Company Y is getting this message because we see a long record of buying paid links that pass PageRank. In particular, we see a lot of low-quality 'sponsored posts' with keyword-rich anchortext where the links pass PageRank. The net effect is that we distrust a lot of links to this site. Here are a couple examples: URL1 and URL2. Bear in mind that we have more examples of these paid posts, but these two examples give a flavor of the sort of thing that should really be resolved. My recommendation would be to get these sort of paid posts taken down, and then Company Y could submit a reconsideration request. Otherwise, we'll continue to distrust quite a few links to the site." 

Example scenario: reputation management 

In some cases we're ignoring links to a site where the site itself didn't violate our guidelines. A good example of that is reputation management. We had two groups write in; one was a large news website, while the other was a not-for-profit publisher. Both had gotten the new link message. In one case, it appeared that a "reputation management" firm was using spammy links to try to push up positive articles on the news site, and we were ignoring those links to the news site. In the other case, someone was trying to manipulate the search results for a person's name by buying links on a well-known paid text link ad network. Likewise, we were just ignoring those specific links, and the not-for-profit publisher didn't need to take any action.

What should I do if I get the new link message? 

We recently launched the ability to download backlinks to your site sorted by date. If you get this new link message, you may want to check your most recent links to spot anything unusual going on. If you discover that someone in your company has been doing widgetbait, paid links, or serious linkspam, it's worth cleaning that up and submitting a reconsideration request. We're also looking at some ways to provide more concrete examples to make these messages more actionable and to help narrow down where to look when you get one.

Just to give you some context, less than 20,000 domains received these new messages—that's less than one-tenth the number of messages we send in a typical month—and that's only because we sent out messages retroactively to any site where we had distrusted some of the sites' backlinks. Going forward, based on our current level of action, on average only about 10 sites a day will receive this message. 

Summing up 

I hope this post and some of the examples above will help to convey the nuances of this new message. If you get one of these new messages, it's not a cause for panic, but neither should you completely ignore it. The message says that the current incident isn't affecting our opinion of the entire website, but it is affecting our opinion of some links to the website, and the site might not rank as well for some phrases as a result.

This message reflects an issue of moderate severity, and we're trying to find the right way to alert people that their site may have a potential issue (and it's worth some investigation) without overly stressing out site owners either. But we wanted to take this extra step toward more transparency now so that we can let site owners know when they might want to take a closer look at their current links.

Tuesday, July 24, 2012

Behold Google index secrets, revealed!

Webmaster level: All

Since Googlebot was born, webmasters around the world have been asking one question: Google, oh, Google, are my pages in the index? Now is the time to answer that question using the new Index Status feature in Webmaster Tools. Whether one or one million, Index Status will show you how many pages from your site have been included in Google’s index.

Index Status is under the Health menu. After clicking on it you’ll see a graph like the following:

It shows how many pages are currently indexed. The legend shows the latest count and the graph shows up to one year of data.

If you see a steadily increasing number of indexed pages, congratulations! This should be enough to confirm that new content on your site is being discovered, crawled and indexed by Google.

However, some of you may find issues that require looking a little bit deeper. That’s why we added an Advanced tab to the feature. You can access it by clicking on the button at the top, and it will look like this:

The advanced section will show not only totals of indexed pages, but also the cumulative number of pages crawled, the number of pages that we know about which are not crawled because they are blocked by robots.txt, and also the number of pages that were not selected for inclusion in our results.

Notice that the counts are always totals. So, for example, if on June 17th the count for indexed pages is 92, that means that there are a total of 92 pages indexed at this point in time, not that 92 pages were added to the index on that day only. In particular for sites with a long history, the count of pages crawled may be very big in comparison with the number of pages indexed.

All this data can be used to identify and debug a variety of indexing-related problems. For example, if some of your content doesn’t appear any more on Google and you notice that the graph of pages indexed has a sudden drop, that may be an indication that you introduced a site-wide error when using meta=”noindex” and now Google isn’t including your content in search results.

Another example: if you change the URL structure of your site and don’t follow our recommendations for moving your site, you may see a jump in the count of “Not selected”. Fixing the redirects or rel=”canonical” tags should help get better indexing coverage.

We hope that Index Status will bring more transparency into Google’s index selection process and help you identify and fix indexing problems with your sites. And if you have questions, don’t hesitate to ask in our Help Forum.

Posted by , and, Webmaster Tools Team

Monday, July 16, 2012

On web semantics

Webmaster level: All

In web development context, semantics refers to semantic markup, which means markup used according to its meaning and purpose.

Markup used according to its purpose means using heading elements (for instance, h1 to h6) to mark up headings, paragraph elements (p) for paragraphs, lists (ul, ol, dl, also datalist or menu) for lists, tables for data tables, and so on.

Stating the obvious became necessary in the old days, when the Web consisted of only a few web sites and authors used tables to code entire sites, table cells or paragraphs for headings, and thought about other creative ways to achieve the layout they wanted. (Admittedly, these authors had fewer instruments at their disposal than authors have today. There were times when coding a three column layout was literally impossible without using tables or images.)

Up until today authors were not always certain about what HTML element to use for what functional unit in their HTML page, though, and “living” specs like HTML 5 require authors to keep an eye on what elements will be there going forward to mark up what otherwise calls for “meaningless” fallback elements like div or span.

To know what elements HTML offers, and what meaning these elements have, it’s necessary to consult the HTML specs. There are indices—covering all HTML specs and elements—that make it a bit simpler to look up and find out the meaning of an element. However, in many cases it may be necessary to check what the HTML spec says.

For example, take the code element:

The code element represents a fragment of computer code. This could be an XML element name, a filename, a computer program, or any other string that a computer would recognize.

Author-controlled semantics

HTML elements carry meaning as defined by the HTML specs, yet ID and class names can bear meaning too. ID and class names, just like microdata, are typically under author control, the only exception being microformats. (We will not cover microdata or microformats in this article.)

ID and class names give authors a lot of freedom to work with HTML elements. There are a few basic rules of thumb that, when followed, make sure this freedom doesn’t turn into problems:

Advantages of using semantic markup

Using markup according to how it’s meant to be used, as well as modest use of functional ID and class names, has several advantages:

  • It’s the professional thing to do.
  • It’s more accessible.
  • It’s more maintainable.

Special cases

“Neutral” elements, elements with ambiguous meaning, and presentational elements constitute special cases.

div and span offer a “generic mechanism for adding structure to documents.” They can be used whenever there is no other element available that matches what the contents in question represent.

In the past a lot of confusion was caused by the b, strong, i, and em elements. Authors cursed b and i for being presentational, and typically suggested a 1:1 replacement with strong and em. Not to stir up the past, here’s what HTML 5 says, granting all four elements a raison d’être:

b “a span of text to be stylistically offset from the normal prose without conveying any extra importance, such as key words in a document abstract, product names in a review, or other spans of text whose typical typographic presentation is boldened” <p>The <b>frobonitor</b> and <b>barbinator</b> components are fried.
strong “strong importance for its contents” <p><strong>Warning.</strong> This dungeon is dangerous.
i “a span of text in an alternate voice or mood, or otherwise offset from the normal prose, such as a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, a ship name, or some other prose whose typical typographic presentation is italicized” <p>The term <i>prose content</i> is defined above.
em “stress emphasis of its contents” <p><em>Cats</em> are cute animals.

Last but not least, there are truly presentational elements. These elements will be supported by user agents (browsers) for forever but shouldn’t be used anymore as presentational markup is not maintainable, and should be handled by style sheets instead. Some popular ones are:

  • center
  • font
  • s
  • u

How to tell whether you’re on track

A quick and dirty way to check the semantics of your page and understand how it might be interpreted by a screen reader is to disable CSS, for example using the Web Developer Toolbar extension available for Chrome and Firefox. This only identifies issues around the use of CSS to convey meaning, but can still be helpful.

There are also tools like W3C’s semantic data extractor that provide cues on the meaningfulness of your HTML code.

Other methods range from peer reviews (coding best practices) to user testing (accessibility).

Do’s and Don’ts

Don’t Do Reason
<p class"heading">foo</p>
For headings there are heading elements.
<p><font size="2">bar</font></p>

p { font-size: 1em; }
Presentational markup is expensive to maintain.
<td class="heading">baz</td>
Use table elements for tabular data.
<div class="newrow">foo</div>
<div class="newrow">bar</div>
Use table elements for tabular data.
foo bar.<br><br>baz scribble.
<p>foo bar.</p>
<p>baz scribble.</p>
Denote paragraphs by paragraph elements, not line breaks.

Thursday, July 12, 2012

New Crawl Error alerts from Webmaster Tools

Webmaster level: All

Today we’re rolling out Crawl Error alerts to help keep you informed of the state of your site.

Since Googlebot regularly visits your site, we know when your site exhibits connectivity issues or suddenly spikes in pages returning HTTP error response codes (e.g. 404 File Not Found, 403 Forbidden, 503 Service Unavailable, etc). If your site is timing out or is exhibiting systemic errors when accessed by Googlebot, other visitors to your site might be having the same problem!

When we see such errors, we may send alerts –- in the form of messages in the Webmaster Tools Message Center –- to let you know what we’ve detected. Hopefully, given this increased communication, you can fix potential issues that may otherwise impact your site’s visitors or your site’s presence in search.

As we discussed in our blog post announcing the new Webmaster Tools Crawl Errors feature, we divide crawl errors into two types: Site Errors and URL Errors.

Site Error alerts for major site-wide problems

Site Errors represent an inability to connect to your site, and represent systemic issues rather than problems with specific pages. Here are some issues that might cause Site Errors:
  • Your DNS server is down or misconfigured.
  • Your web server itself is firewalled off.
  • Your web server is refusing connections from Googlebot.
  • Your web server is overloaded, or down.
  • Your site’s robots.txt is inaccessible.
These errors are global to a site, and in theory should never occur for a well-operating site (and don’t occur for the large majority of the sites we crawl). If Googlebot detects any appreciable number of these Site Errors, regardless of the size of your site, we’ll try to notify you in the form of a message in the Message Center:

Example of a Site Error alert
The alert provides the number of errors Googlebot encountered crawling your site, the overall crawl error connection rate for your site, a link to the appropriate section of Webmaster Tools to examine the data more closely, and suggestions as to how to fix the problem.

If your site shows a 100% error rate in one of these categories, it likely means that your site is either down or misconfigured in some way. If your site has an error rate less than 100% in any of these categories, it could just indicate a transient condition, but it could also mean that your site is overloaded or improperly configured. You may want to investigate these issues further, or ask about them on our forum.

We may alert you even if the overall error rate is very low — in our experience a well configured site shouldn’t have any errors in these categories.

URL Error anomaly alerts for potentially less critical issues

Whereas any appreciable number of Site Errors could indicate that your site is misconfigured, overloaded, or simply out of service, URL Errors (pages that return a non-200 HTTP code, or incorrectly return an HTTP 200 code in the case of soft 404 errors) may occur on any well-configured site. Because different sites have different numbers of pages and different numbers of external links, a count of errors that indicates a serious problem for a small site might be entirely normal for a large site.

That’s why for URL Errors we only send alerts when we detect a large spike in the number of errors for any of the five categories of errors (Server error, Soft 404, Access denied, Not found or Not followed). For example, if your site routinely has 100 pages with 404 errors, we won’t alert you if that number fluctuates minimally. However we might notify you when that count reaches a much higher number, say 500 or 1,000. Keep in mind that seeing 404 errors is not always bad, and can be a natural part of a healthy website (see our previous blog post: Do 404s hurt my site?).

A large spike in error count could be because something has changed on your site — perhaps a reconfiguration has changed the permissions for a section of your site, or a new version of a script is crashing regularly, or someone accidentally moved or deleted an entire directory, or a reorganization of your site causes external links to no longer work. It could also just be a transient spike, or could be because of external causes (someone has linked to non-existent pages), so there might not even be a problem; but when we see an unusually large number of errors for your site, we’ll let you know so you can investigate:

Example of a URL Error anomaly alert
The alert describes the category of web errors for which we’ve detected a spike, gives a link to the appropriate section of Webmaster Tools so that you can see what pages we think are problematic, and offers troubleshooting suggestions.

Enable Message forwarding to send alerts to your inbox

We know you’re busy, and that routinely checking Webmaster Tools just to check for new alerts might be something you forget to do. Consider turning on Message forwarding. We’ll send any Webmaster Tools messages to the email address of your choice.

Let us know what you think, and if you have any comments or suggestions on our new alerts please visit our forum.