Tuesday, September 30, 2008

Advanced Website Diagnostics with Google Webmaster Tools

Running a website can be complicated—so we've provided Google Webmaster Tools to help webmasters to recognize potential issues before they become real problems. Some of the issues that you can spot there are relatively small (such as having duplicate titles and descriptions), other issues can be bigger (such as your website not being reachable). While Google Webmaster Tools can't tell you exactly what you need to change, it can help you to recognize that there could be a problem that needs to be addressed.

Let's take a look at a few examples that we ran across in the Google Webmaster Help Groups:

Is your server treating Googlebot like a normal visitor?

While Googlebot tries to act like a normal user, some servers may get confused and react in strange ways. For example, although your server may work flawlessly most of the time, some servers running IIS may react with a server error (or some other action that is tied to a server error occurring) when visited by a user with Googlebot's user-agent. In the Webmaster Help Group, we've seen IIS servers return result code 500 (Server error) and result code 404 (File not found) in the "Web crawl" diagnostics section, as well as result code 302 when submitting Sitemap files. If your server is redirecting to an error page, you should make sure that we can crawl the error page and that it returns the proper result code. Once you've done that, we'll be able to show you these errors in Webmaster Tools as well. For more information about this issue and possible resolutions, please see and

If your website is hosted on a Microsoft IIS server, also keep in mind that URLs are case-sensitive by definition (and that's how we treat them). This includes URLs in the robots.txt file, which is something that you should be careful with if your server is using URLs in a non-case-sensitive way. For example, "disallow: /paris" will block /paris but not /Paris.

Does your website have systematically broken links somewhere?

Modern content management systems (CMS) can make it easy to create issues that affect a large number of pages. Sometimes these issues are straightforward and visible when you view the pages; sometimes they're a bit harder to spot on your own. If an issue like this creates a large number of broken links, they will generally show up in the "Web crawl" diagnostics section in your Webmaster Tools account (provided those broken URLs return a proper 404 result code). In one recent case, a site had a small encoding issue in its RSS feed, resulting in over 60,000 bad URLs being found and listed in their Webmaster Tools account. As you can imagine, we would have preferred to spend time crawling content instead of these 404 errors :).

Is your website redirecting some users elsewhere?

For some websites, it can make sense to concentrate on a group of users in a certain geographic location. One method of doing that can be to redirect users located elsewhere to a different page. However, keep in mind that Googlebot might not be crawling from within your target area, so it might be redirected as well. This could mean that Googlebot will not be able to access your home page. If that happens, it's likely that Webmaster Tools will run into problems when it tries to confirm the verification code on your site, resulting in your site becoming unverified. This is not the only reason for a site becoming unverified, but if you notice this on a regular basis, it would be a good idea to investigate. On this subject, always make sure that Googlebot is treated the same way as other users from that location, otherwise that might be seen as cloaking.

Is your server unreachable when we try to crawl?

It can happen to the best of sites—servers can go down and firewalls can be overly protective. If that happens when Googlebot tries to access your site, we won't be able crawl the website and you might not even know that we tried. Luckily, we keep track of these issues and you can spot "Network unreachable" and "robots.txt unreachable" errors in your Webmaster Tools account when we can't reach your site.

Has your website been hacked?

Hackers sometimes add strange, off-topic hidden content and links to questionable pages. If it's hidden, you might not even notice it right away; but nonetheless, it can be a big problem. While the Message Center may be able to give you a warning about some kinds of hidden text, it's best if you also keep an eye out yourself. Google Webmaster Tools can show you keywords from your pages in the "What Googlebot sees" section, so you can often spot a hack there. If you see totally irrelevant keywords, it would be a good idea to investigate what's going on. You might also try setting up Google Alerts or doing queries such as [ spammy words], where "spammy words" might be words like porn, viagra, tramadol, sex or other words that your site wouldn't normally show. If you find that your site actually was hacked, I'd recommend going through our blog post about things to do after being hacked.

There are a lot of issues that can be recognized with Webmaster Tools; these are just some of the more common ones that we've seen lately. Because it can be really difficult to recognize some of these problems, it's a great idea to check your Webmaster Tools account to make sure that you catch any issues before they become real problems. If you spot something that you absolutely can't pin down, why not post in the discussion group and ask the experts there for help?

Have you checked your site lately?

Friday, September 26, 2008

Keeping comment spam off your site and away from users

So, you've set up a forum on your site for the first time, or enabled comments on your blog. You carefully craft a post or two, click the submit button, and wait with bated breath for comments to come in.

And they do come in. Perhaps you get a friendly note from a fellow blogger, a pressing update from an MMORPG guild member, or a reminder from your Aunt Millie about dinner on Thursday. But then you get something else. Something... disturbing. Offers for deals that are too good to be true, bizarre logorrhean gibberish, and explicit images you certainly don't want Aunt Millie to see. You are now buried in a deluge of dreaded comment spam.

Comment spam is bad stuff all around. It's bad for you, because it adds to your workload. It's bad for your users, who want to find information on your site and certainly aren't interested in dodgy links and unrelated content. It's bad for the web as a whole, since it discourages people from opening up their sites for user-contributed content and joining conversations on existing forums.

So what can you, as a webmaster, do about it?

A quick disclaimer: the list below is a good start, but not exhaustive. There are so many different blog, forum, and bulletin board systems out there that we can't possibly provide detailed instructions for each, so the points below are general enough to make sense on most systems.

Make sure your commenters are real people
  • Add a CAPTCHA. CAPTCHAs require users to read a bit of obfuscated text and type it back in to prove they're a human being and not an automated script. If your blog or forum system doesn't have CAPTCHAs built in you may be able to find a plugin like Recaptcha, a project which also helps digitize old books. CAPTCHAs are not foolproof but they make life a little more difficult for spammers. You can read more about the many different types of CAPTCHAS, but keep in mind that just adding a simple one can be fairly effective.

  • Block suspicious behavior. Many forums allow you to set time limits between posts, and you can often find plugins to look for excessive traffic from individual IP addresses or proxies and other activity more common to bots than human beings.

Use automatic filtering systems
  • Block obviously inappropriate comments by adding words to a blacklist. Spammers obfuscate words in their comments so this isn't a very scalable solution, but it can keep blatant spam at bay.

  • Use built-in features or plugins that delete or mark comments as spam for you. Spammers use automated methods to besmirch your site, so why not use an automated system to defend yourself?  Comprehensive systems like Akismet, which has plugins for many blogs and forum systems and TypePad AntiSpam, which is open-source and compatible with Akismet, are easy to install and do most of the work for you. 

  • Try using Bayesian filtering options, if available. Training the system to recognize spam may require some effort on your part, but this technique has been used successfully to fight email spam

Make your settings a bit stricter
  • Nofollow untrusted links. Many systems have a setting to add a rel="nofollow" attribute to the links in comments, or do so by default. This may discourage some types of spam, but it's definitely not the only measure you should take.

  • Consider requiring users to create accounts before they can post a comment. This adds steps to the user experience and may discourage some casual visitors from posting comments, but may keep the signal-to-noise ratio higher as well.

  • Change your settings so that comments need to be approved before they show up on your site. This is a great tactic if you want to hold comments to a high standard, don't expect a lot of comments, or have a small, personal site. You may be able to allow employees or trusted users to approve posts themselves, spreading the workload. 

  • Think about disabling some types of comments. For example, you may want to disable comments on very old posts that are unlikely to get legitimate comments. On blogs you can often disable trackbacks and pingbacks, which are very cool features but can be major avenues for automated spam.

Keep your site up-to-date
  • Take the time to keep your software up-to-date and pay special attention to important security updates. Some spammers take advantage of security holes in older versions of blogs, bulletin boards, and other content management systems. Check the Quick Security Checklist for additional measures.

You may need to strike a balance on which tactics you choose to implement depending on your blog or bulletin board software, your user base, and your level of experience. Opening up a site for comments without any protection is a big risk, whether you have a small personal blog or a huge site with thousands of users. Also, if your forum has been completely filled with thousands of spam posts and doesn't even show up in Google searches, you may want to submit a reconsideration request after you clear out the bad content and take measures to prevent further spam.

As a long-time blogger and web developer myself, I can tell you that a little time spent setting up measures like these up front can save you a ton of time and effort later. I'm new to the Webmaster Central team, originally from Cleveland. I'm very excited to help fellow webmasters, and have a passion for usability and search quality (I've even done a bit of academic research on the topic). Please share your tips on preventing comment and forum spam in the comments below, and as always you're welcome to ask questions in our discussion group.

Tuesday, September 23, 2008

More webmaster questions - Answered!

When it comes to answering your webmaster related questions, we just can't get enough. I wanted to follow-up and answer some additional questions that webmasters asked in our latest installment of Popular Picks. In case you missed it, you can find our answers to image search ranking, sitelinks, reconsideration requests, redirects, and our communication with webmasters in this blog post.

Check out these resources for additional details on questions answered in the video:
Video Transcript:

Hi everyone, I'm Reid from the Search Quality team. Today I'd like to answer some of the unanswered questions from our latest round of popular picks.

Searchmaster had a question about duplicate content. Understandably, this is a popular concern from webmasters. You should check out the Google Webmaster Central Blogwhere my colleague Susan Moskwa recently posted "Demystifying the 'duplicate content penalty," which answers many questions and concerns about duplicate content.

Jay is the Boss wanted to know if e-commerce websites suffer if they have two or more different themes. For example, you could have a site that sells auto parts, but also sightseeing guides. In general, I'd encourage webmasters to create a website that they feel is relevant for users. If it makes sense to sell auto parts and sightseeing guides, then go for it. Those are the sites that perform well, because users want to visit those sites, and they'll link to them as well.

emma2 wanted to know if Google will follow links on a page using the "noindex" attribute in the "robots" meta tag. To answer this question, Googlebot will follow links on a page which uses the meta "noindex" tag, but that page will not appear in our search results. As a reminder, if you would like to prevent Googlebot from crawling any links on a page, use the "nofollow" attribute in the "robots" meta tag.

Aaron Pratt wanted to know about some ways a webmaster can rank well for local searches. A quick recommendation is to add your business to the Local Business Center. There, you can add contact information as well as store operating hours and coupons as well. Another example, or a tip, is to take advantage and purchase a country-specific top-level domain, or use the geotargeting feature in Webmaster Tools.

jdeb901 said it would be helpful if we could let webmasters know if we are having problems with Webmaster Tools. This is an excellent point, and we're always thinking about better ways to communicate with webmasters. If you're having problems with Webmaster Tools, chances are someone else is as well, and they've posted to the Google Webmaster Help Group about this. In the past, if we've experienced problems with Webmaster Tools, we've also created a "sticky" post to let users know that we know about these issues with Webmaster Tools, and we're working to find a solution.

Well, that about wraps it up with our Popular Picks. Thanks again for all of your questions, and I look forward to seeing you around the group.

Monday, September 22, 2008

Dynamic URLs vs. static URLs

Chatting with webmasters often reveals widespread beliefs that might have been accurate in the past, but are not necessarily up-to-date any more. This was the case when we recently talked to a couple of friends about the structure of a URL. One friend was concerned about using dynamic URLs, since (as she told us) "search engines can't cope with these." Another friend thought that dynamic URLs weren't a problem at all for search engines and that these issues were a thing of the past. One even admitted that he never understood the fuss about dynamic URLs in comparison to static URLs. For us, that was the moment we decided to read up on the topic of dynamic and static URLs. First, let's clarify what we're talking about:

What is a static URL?
A static URL is one that does not change, so it typically does not contain any url parameters. It can look like this: You can search for static URLs on Google by typing filetype:htm in the search field. Updating these kinds of pages can be time consuming, especially if the amount of information grows quickly, since every single page has to be hard-coded. This is why webmasters who deal with large, frequently updated sites like online shops, forum communities, blogs or content management systems may use dynamic URLs.

What is a dynamic URL?
If the content of a site is stored in a database and pulled for display on pages on demand, dynamic URLs maybe used. In that case the site serves basically as a template for the content. Usually, a dynamic URL would look something like this: You can spot dynamic URLs by looking for characters like: ? = &. Dynamic URLs have the disadvantage that different URLs can have the same content. So different users might link to URLs with different parameters which have the same content. That's one reason why webmasters sometimes want to rewrite their URLs to static ones.

Should I try to make my dynamic URLs look static?
Following are some key points you should keep in mind while dealing with dynamic URLs:
  1. It's quite hard to correctly create and maintain rewrites that change dynamic URLs to static-looking URLs.
  2. It's much safer to serve us the original dynamic URL and let us handle the problem of detecting and avoiding problematic parameters.
  3. If you want to rewrite your URL, please remove unnecessary parameters while maintaining a dynamic-looking URL.
  4. If you want to serve a static URL instead of a dynamic URL you should create a static equivalent of your content.
Which can Googlebot read better, static or dynamic URLs?
We've come across many webmasters who, like our friend, believed that static or static-looking URLs were an advantage for indexing and ranking their sites. This is based on the presumption that search engines have issues with crawling and analyzing URLs that include session IDs or source trackers. However, as a matter of fact, we at Google have made some progress in both areas. While static URLs might have a slight advantage in terms of clickthrough rates because users can easily read the urls, the decision to use database-driven websites does not imply a significant disadvantage in terms of indexing and ranking. Providing search engines with dynamic URLs should be favored over hiding parameters to make them look static.

Let's now look at some of the widespread beliefs concerning dynamic URLs and correct some of the assumptions which spook webmasters. :)

Myth: "Dynamic URLs cannot be crawled."
Fact: We can crawl dynamic URLs and interpret the different parameters. We might have problems crawling and ranking your dynamic URLs if you try to make your urls look static and in the process hide parameters which offer the Googlebot valuable information. One recommendation is to avoid reformatting a dynamic URL to make it look static. It's always advisable to use static content with static URLs as much as possible, but in cases where you decide to use dynamic content, you should give us the possibility to analyze your URL structure and not remove information by hiding parameters and making them look static.

Myth: "Dynamic URLs are okay if you use fewer than three parameters."
Fact: There is no limit on the number of parameters, but a good rule of thumb would be to keep your URLs short (this applies to all URLs, whether static or dynamic). You may be able to remove some parameters which aren't essential for Googlebot and offer your users a nice looking dynamic URL. If you are not able to figure out which parameters to remove, we'd advise you to serve us all the parameters in your dynamic URL and our system will figure out which ones do not matter. Hiding your parameters keeps us from analyzing your URLs properly and we won't be able to recognize the parameters as such, which could cause a loss of valuable information.

Following are some questions we thought you might have at this point.

Does that mean I should avoid rewriting dynamic URLs at all?
That's our recommendation, unless your rewrites are limited to removing unnecessary parameters, or you are very diligent in removing all parameters that could cause problems. If you transform your dynamic URL to make it look static you should be aware that we might not be able to interpret the information correctly in all cases. If you want to serve a static equivalent of your site, you might want to consider transforming the underlying content by serving a replacement which is truly static. One example would be to generate files for all the paths and make them accessible somewhere on your site. However, if you're using URL rewriting (rather than making a copy of the content) to produce static-looking URLs from a dynamic site, you could be doing harm rather than good. Feel free to serve us your standard dynamic URL and we will automatically find the parameters which are unnecessary.

Can you give me an example?
If you have a dynamic URL which is in the standard format like foo?key1=value&key2=value2 we recommend that you leave the url unchanged, and Google will determine which parameters can be removed; or you could remove uncessary parameters for your users. Be careful that you only remove parameters which do not matter. Here's an example of a URL with a couple of parameters:
  • language=en - indicates the language of the article
  • answer=3 - the article has the number 3
  • sid=8971298178906 - the session ID number is 8971298178906
  • query=URL - the query with which the article was found is [URL]
Not all of these parameters offer additional information. So rewriting the URL to probably would not cause any problems as all irrelevant parameters are removed.

The following are some examples of static-looking URLs which may cause more crawling problems than serving the dynamic URL without rewriting:
Rewriting your dynamic URL to one of these examples could cause us to crawl the same piece of content needlessly via many different URLs with varying values for session IDs (sid) and query. These forms make it difficult for us to understand that URL and 98971298178906 have nothing to do with the actual content which is returned via this URL. However, here's an example of a rewrite where all irrelevant parameters have been removed:
Although we are able to process this URL correctly, we would still discourage you from using this rewrite as it is hard to maintain and needs to be updated as soon as a new parameter is added to the original dynamic URL. Failure to do this would again result in a static looking URL which is hiding parameters. So the best solution is often to keep your dynamic URLs as they are. Or, if you remove irrelevant parameters, bear in mind to leave the URL dynamic as the above example of a rewritten URL shows:
We hope this article is helpful to you and our friends to shed some light on the various assumptions around dynamic URLs. Please feel free to join our discussion group if you have any further questions.

Tuesday, September 16, 2008

Webmaster Tools made easier in French, Italian, German and Spanish

We're always working for new ways to make life a bit easier for webmasters. We've had great feedback to many of the initiatives that have taken place in Webmaster Tools and beyond, but given the complex nature of managing a website, there are some questions regarding the tools that come up quite often across the Webmaster Help Groups. This got us thinking: how can we best address these questions?

Well, if you're like me, then you find it a lot easier to learn how to use something if you actually get to see someone else doing it first; with that in mind, we'll launch a series of six video tutorials in French, German, Italian and Spanish over the next couple of months. The videos will take you through the basics of Webmaster Tools as well as how to use the information in the tools to make improvements to your site and hence your site's visibility in Google's index.

Our first video provides an overview of the different information you can access depending on whether you've verified ownership of your site in Webmaster Tools. We'll also explain the different verification methods available. And just to whet your appetite, here are the topics covered in the series:

Video 1: Getting started, signing in, benefits of verifying a site
Video 2: Setting preferences for crawling and indexing
Video 3: Creating and submitting Sitemaps
Video 4: Removing and preventing your content from being indexed
Video 5: Utilizing the Diagnostics, Statistics and Links sections
Video 6: Communicating between Webmasters and Google

You can access the first of these videos in the links provided below and keep a lookout in the local Webmaster Help Groups for upcoming video releases.

Italian Video Tutorials - Italian Webmaster Help Group
Latin America and Spain Video Tutorials - Spanish Webmaster Help Group
French Video Tutorials - French Webmaster Help Group
German Video Tutorials - German Webmaster Help Group - German Webmaster Blog


Friday, September 12, 2008

Demystifying the "duplicate content penalty"

Duplicate content. There's just something about it. We keep writing about it, and people keep asking about it. In particular, I still hear a lot of webmasters worrying about whether they may have a "duplicate content penalty."
Let's put this to bed once and for all, folks: There's no such thing as a "duplicate content penalty." At least, not in the way most people mean when they say that.
There are some penalties that are related to the idea of having the same content as another site—for example, if you're scraping content from other sites and republishing it, or if you republish content without adding any additional value. These tactics are clearly outlined (and discouraged) in our Webmaster Guidelines:
  • Don't create multiple pages, subdomains, or domains with substantially duplicate content.
  • Avoid... "cookie cutter" approaches such as affiliate programs with little or no original content.
  • If your site participates in an affiliate program, make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site first.
(Note that while scraping content from others is discouraged, having others scrape you is a different story; check out this post if you're worried about being scraped.)
But most site owners whom I hear worrying about duplicate content aren't talking about scraping or domain farms; they're talking about things like having multiple URLs on the same domain that point to the same content. Like and Having this type of duplicate content on your site can potentially affect your site's performance, but it doesn't cause penalties. From our article on duplicate content:
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.
This type of non-malicious duplication is fairly common, especially since many CMSs don't handle this well by default. So when people say that having this type of duplicate content can affect your site, it's not because you're likely to be penalized; it's simply due to the way that web sites and search engines work.
Most search engines strive for a certain level of variety; they want to show you ten different results on a search results page, not ten different URLs that all have the same content. To this end, Google tries to filter out duplicate documents so that users experience less redundancy. You can find details in this blog post, which states:
  1. When we detect duplicate content, such as through variations caused by URL parameters, we group the duplicate URLs into one cluster.
  2. We select what we think is the "best" URL to represent the cluster in search results.
  3. We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL.
Here's how this could affect you as a webmaster:
  • In step 2, Google's idea of what the "best" URL is might not be the same as your idea. If you want to have control over whether or gets shown in our search results, you may want to take action to mitigate your duplication. One way of letting us know which URL you prefer is by including the preferred URL in your Sitemap.
  • In step 3, if we aren't able to detect all the duplicates of a particular page, we won't be able to consolidate all of their properties. This may dilute the strength of that content's ranking signals by splitting them across multiple URLs.
In most cases Google does a good job of handling this type of duplication. However, you may also want to consider content that's being duplicated across domains. In particular, deciding to build a site whose purpose inherently involves content duplication is something you should think twice about if your business model is going to rely on search traffic, unless you can add a lot of additional value for users. For example, we sometimes hear from affiliates who are having a hard time ranking for content that originates solely from Amazon. Is this because Google wants to stop them from trying to sell Everyone Poops? No; it's because how the heck are they going to outrank Amazon if they're providing the exact same listing? Amazon has a lot of online business authority (most likely more than a typical Amazon affiliate site does), and the average Google search user probably wants the original information on Amazon, unless the affiliate site has added a significant amount of additional value.
Lastly, consider the effect that duplication can have on your site's bandwidth. Duplicated content can lead to inefficient crawling: when Googlebot discovers ten URLs on your site, it has to crawl each of those URLs before it knows whether they contain the same content (and thus before we can group them as described above). The more time and resources that Googlebot spends crawling duplicate content across multiple URLs, the less time it has to get to the rest of your content.
In summary: Having duplicate content can affect your site in a variety of ways; but unless you've been duplicating deliberately, it's unlikely that one of those ways will be a penalty. This means that:
  • You typically don't need to submit a reconsideration request when you're cleaning up innocently duplicated content.
  • If you're a webmaster of beginner-to-intermediate savviness, you probably don't need to put too much energy into worrying about duplicate content, since most search engines have ways of handling it.
  • You can help your fellow webmasters by not perpetuating the myth of duplicate content penalties! The remedies for duplicate content are entirely within your control. Here are some good places to start.

Wednesday, September 10, 2008

Your burning questions - Answered!

In a recent blog post highlighting our Webmaster Help Group, I asked for your webmaster-related questions. In our second installment of Popular Picks, we hoped to discover which issues webmasters wanted to learn more about, and then respond with some better documentation on those topics. It looks like it was a success, so please get clicking:
Thanks again for your questions! See you around the group.

Friday, September 5, 2008

Workin' it on all browsers

To web surfers, Google Chrome is a quick, exciting new browser. As webmasters, it's a good reminder that regardless of the browser your visitors use to access your site—Firefox, Internet Explorer, Google Chrome, Safari, etc.—browser compatibility is often a high priority. When your site renders poorly or is difficult to use on many browsers you risk losing your visitors' interest, and, if you're running a monetized site, perhaps their business. Here's a quick list to make sure you're covering the basics:

Step 1: Ensure browser compatibility by focusing on accessibility
The same techniques that make your site more accessible to search engines, such as static HTML versus fancy features like AJAX, often help your site's compatibility on various browsers and numerous browser versions. Simpler HTML is often more easily cross-compatible than the latest techniques.

Step 2: Consider validating your code
If your code passes validation, you've eliminated one potential issue in browser compatibility. With validated code, you won't need to rely on each browsers' error handling technique. There's a greater chance that your code will function across different browsers, and it's easier to debug potential problems.

Step 3: Check that it's usable (not just properly rendered)
It's important that your site displays well; but equally important, make sure that users can actually use your site's features in their browser. Rather than just looking at a snapshot of your site, try navigating through your site on various browsers or adding items to your shopping cart. It's possible that the clickable area of a linked image or button may change from browser to browser. Additionally, if you use JavaScript for components like your shopping cart, it may work in one browser but not another.

Step 4: Straighten out the kinks
This step requires some trial and error, but there are several good places to help reduce the "trials" as your make your site cross-browser compatible. Doctype is an open source reference with test cases for cross-browser compatibility, as well as CSS tips and tricks.

For example, let's say you're wondering how to find the offset for an element on your page. You notice that your code works in Internet Explorer, but not Firefox and Safari. It turns out that certain browsers are a bit finicky when it comes to finding the offset—thankfully contributors to Doctype provide the code to work around the issue.

Step 5: Share your browser compatibility tips and resources!
We'd love to hear the steps you're taking to ensure your site works for the most visitors. We've written a more in-depth Help Center article on the topic which discusses such things as specifying a character encoding. If you have additional tips, please share. And, if you have browser compatibility questions regarding search, please ask!