Pages

Wednesday, April 23, 2008

Where in the world is your site?



The Set Geographic Target tool in Webmaster Tools lets you associate your site with a specific region. We've heard a lot of questions from webmasters about how to use the tool, and here Webmaster Trends Analyst Susan Moskwa explains how it works and when to use it.



The http://www.google.ca/ example in the video is a little hard to see, so here's a screenshot:

the Google Canada home page

Want to know more about setting a geographic target for your site? Check out our Help Center. And if you like this video, you can see more on our Webmaster Tools playlist on YouTube.

Retiring support for OAI-PMH in Sitemaps



When we originally launched Sitemaps, we included support for the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) 2.0 protocol, an interoperability framework based on metadata harvesting. In the meantime, however, we've found that the information we gain from our support of OAI-PMH is disproportional to the amount of resources required to support it. Fewer than 200 sites are using OAI-PMH for Google Sitemaps at the moment.

In order to move forward with even better coverage of your websites, we have decided to support only the standard XML Sitemap format by May 2008. We are in the process of notifying sites using OAI-PMH to alert them of the change.

If you have been using OAI-PMH as a Google Sitemap feed, we would love to see you adopt the industry standard XML Sitemap format. This format is supported by all of the major search engines and helps to make sure that everyone is able to find your new and updated content as soon as you make it available.

If you have any questions regarding the move to XML Sitemap files, feel free to post in our Google discussion group for Sitemaps.

Wednesday, April 16, 2008

Best practices when moving your site

Planning on moving your site to a new domain? Lots of webmasters find this a scary process. How do you do it without hurting your site's performance in Google search results?


moving your site
Your aim is to make the transition invisible and seamless to the user, and to make sure that Google knows that your new pages should get the same quality signals as the pages on your own site. When you're moving your site, pesky 404 (File Not Found) errors can harm the user experience and negatively impact your site's performance in Google search results.

Let's cover moving your site to a new domain (for instance, changing from www.example.com to www.example.org). This is different from moving to a new IP address; read this post for more information on that.

Here are the main points:

  • Test the move process by moving the contents of one directory or subdomain first. Then use a 301 Redirect to permanently redirect those pages on your old site to your new site. This tells Google and other search engines that your site has permanently moved.
  • Once this is complete, check to see that the pages on your new site are appearing in Google's search results. When you're satisfied that the move is working correctly, you can move your entire site. Don't do a blanket redirect directing all traffic from your old site to your new home page. This will avoid 404 errors, but it's not a good user experience. A page-to-page redirect (where each page on the old site gets redirected to the corresponding page on the new site) is more work, but gives your users a consistent and transparent experience. If there won't be a 1:1 match between pages on your old and new site, try to make sure that every page on your old site is at least redirected to a new page with similar content.
  • If you're changing your domain because of site rebranding or redesign, you might want to think about doing this in two phases: first, move your site; and second, launch your redesign. This manages the amount of change your users see at any stage in the process, and can make the process seem smoother. Keeping the variables to a minimum also makes it easier to troubleshoot unexpected behavior.
  • Check both external and internal links to pages on your site. Ideally, you should contact the webmaster of each site that links to yours and ask them to update the links to point to the page on your new domain. If this isn't practical, make sure that all pages with incoming links are redirected to your new site. You should also check internal links within your old site, and update them to point to your new domain. Once your content is in place on your new server, use a link checker like Xenu to make sure you don't have broken legacy links on your site. This is especially important if your original content included absolute links (like www.example.com/cooking/recipes/chocolatecake.html) instead of relative links (like .../recipes/chocolatecake.html).
  • To prevent confusion, it's best to make sure you retain control of your old site domain for at least 180 days.
  • Finally, keep both your new and old site verified in Webmaster Tools, and review crawl errors regularly to make sure that the 301s from the old site are working properly, and that the new site isn't showing unwanted 404 errors.
We'll admit it, moving is never easy - but these steps should help ensure that none of your good web reputation falls off the truck in the process.

Monday, April 14, 2008

Webmaster tips for creating accessible, crawlable sites


Raman and Hubbell at home
Hubbell and I enjoying the day at our home in California. Please feel free to view my earlier post about accessibility for webmasters, as well as additional articles I've written for the Official Google blog.

One of the most frequently asked questions about Accessible Search is: What can I do to make my site rank well on Accessible Search? At the same time, webmasters often ask a similar but broader question: What can I do to rank high on Google Search?

Well I'm pleased to tell you that you can kill two birds with one stone: critical site features such as site navigation can be created to work for all users, including our own Googlebot. Below are a few tips for you to consider.

Ensure that all critical content is reachable

To access content, it needs to be reachable. Users and web crawlers reach content by navigating through hyperlinks, so as a critical first step, ensure that all content on your site is reachable via plain HTML hyperlinks, and avoid hiding critical portions of your site behind technologies such as JavaScript or Flash.

Plain hyperlinks are hyperlinks created via an HTML anchor element <a>. Next, ensure that the target of all hyperlinks i.e. <a> elements are real URLs, rather than using an empty hyperlink while deferring hyperlink behavior to an onclick handler.

In short, avoid hyperlinks of the form:
<a href="#" onclick="javascript:void(...)">Product Catalog</a>

In preference of simpler links, such as:
<a href="http://www.example.com/product-catalog.html">Product Catalog</a>

Ensure that content is readable

To be useful, content needs to be readable by everyone. Ensure that all important content on your site is present within the text of HTML documents. Content needs to be available without needing to evaluate scripts on a page. Content hidden behind Flash animations or text generated within the browser by executable JavaScript remains opaque to the Googlebot, as well as to most blind users.

Ensure that content is available in reading order

Having discovered and arrived at your readable content, a user needs to be able to follow the content you've put together in its logical reading order. If you are using a complex, multi-column layout for most of the content on your site, you might wish to step back and analyze how you are achieving the desired effect. For example, using deeply-nested HTML tables makes it difficult to link together related pieces of text in a logical manner.

The same effect can often be achieved using CSS and logically organized <div> elements in HTML. As an added bonus, you will find that your site renders much faster as a result.

Supplement all visual content--don't be afraid of redundancy!

Making information accessible to all does not mean that you need to 'dumb down' your site to simple text. Making your content maximally redundant is critical in ensuring that your content is maximally useful to everyone. Here are a few simple tips:
  • Ensure that content communicated via images is available when those images are missing. This goes further than adding appropriate alt attributes to relevant images. Ensure that the text surrounding the image does an adequate job of setting the context for why the image is being used, as well as detailing the conclusions you expect a person seeing the image to draw. In short, if you want to make sure everyone knows it's a picture of a bridge, wrap that text around the image.

  • Add relevant summaries and captions to tables so that the reader can gain a high-level appreciation for the information being conveyed before delving into the details contained within.

  • Accompany visual animations such as data displays with a detailed textual summary.
Following these simple tips greatly increases the quality of your landing pages for everyone. As a positive side-effect, you'll most likely discover that your site gets better indexed!

Friday, April 11, 2008

Crawling through HTML forms



Google is constantly trying new ideas to improve our coverage of the web. We already do some pretty smart things like scanning JavaScript and Flash to discover links to new web pages, and today, we would like to talk about another new technology we've started experimenting with recently.

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

Needless to say, this experiment follows good Internet citizenry practices. Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate.  Similarly, we only retrieve GET forms and avoid forms that require any kind of user information. For example, we omit any forms that have a password input or that use terms commonly associated with personal information such as logins, userids, contacts, etc. We are also mindful of the impact we can have on web sites and limit ourselves to a very small number of fetches for a given site.

The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.

This experiment is part of Google's broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.

Monday, April 7, 2008

My site's been hacked - now what?

Written by Nathan Johns, Search Quality Team

All right, you got hacked. It happens to many webmasters, even despite the hard work you devote to prevent this type of thing from happening. Prevention tips include keeping your site updated with the latest software and patches, creating an account with Google Webmaster Tools to see what's being indexed, keeping tabs on your log files to make sure nothing fishy's going on, etc. (There's more information in the Quick Security Checklist we posted last year.)

Remember that you're not alone—hacked sites are becoming increasingly common. Getting hacked can result in your site being infected with badware (more specifically malware, one type of badware). Take a look at StopBadware's recently released report on Trends in Badware 2007 for a comprehensive analysis of threats and trends over the previous year. Check out this post on the Google Online Security Blog which highlights the increasing number of search results containing a URL labeled as harmful. For even more in-depth technical reports on the analysis of web-based malware, see The Ghost in the Browser (pdf) and this technical report (pdf) on drive-by downloads. Read these, and you'll have a much better understanding of the scope of the problem. They also include some real examples for different types of malware.

The first step in any case should be to contact your hosting provider, if you have one. Often times they can handle most of the technical heavy lifting for you. Lots of webmasters use shared hosting, which can make it difficult to do some of the things listed below. Certain tips labeled with an asterisk (*) are cases in which webmasters using shared hosting will most likely require assistance from their hosting provider. In the case that you do have full control over your server, we recommend covering these four bases:

Getting your site off-line
  • Take your site off-line temporarily, at least until you know you've fixed things.*
  • If you can't take it off-line, return a 503 status code to prevent it from being crawled.
  • In the Webmaster Tools, use the URL removal tool to remove any hacked pages or URLs from search results that may have been added. This will prevent the hacked pages from being served to users.

Damage Assessment
  • It's a good idea to figure out exactly what the hacker was after.
    • Were they looking for sensitive information?
    • Did they want to gain control of your site for other purposes?
  • Look for any modified or uploaded files on your web server.
  • Check your server logs for any suspicious activity, such as failed login attempts, command history (especially as root), unknown user accounts, etc.
  • Determine the scope of the problem—do you have other sites that may be affected?

Recovery
  • The absolute best thing to do here is a complete reinstall of the OS from a trusted source. It's the only way to be completely sure you've removed everything the hacker may have done.*
  • After a fresh re-installation, use the latest backup you have to restore your site. Don't forget to make sure the backup is clean and free of hacked content too.*
  • Patch any software packages to the latest version. This includes things such as weblog platforms, content management systems, or any other type of third-party software installed.
  • Change your passwords - https://www.google.com/accounts/PasswordHelp

Restoring your online presence
  • Get your system back online.
  • If you're a Webmaster Tools user, sign in to your account
    • If your site was flagged as having malware, request a review to determine whether your site is clean
    • If you used the URL removal tool on URLs which you do want in the index, request that Webmaster Tools re-include your content by revoking the removal.
  • Keep an eye on things, as the hacker may try to return.

Answers to other questions you may be asking:

Q: Is it better to take my site off-line or use robots.txt to prevent it from being crawled?
A: Taking it off-line is a better way to go; this prevents any malware or badware from being served to users, and prevents hackers from further abusing the system.

Q: Once I've fixed my site, what's the fastest way to get re-crawled?
A: The best way, regardless of whether or not your site got hacked, is to follow the Webmaster Help Center guidelines.

Q: I've cleaned it up, but will Google penalize me if the hacker linked to any bad neighborhoods?
A: We'll try not to. We're pretty good at making sure good sites don't get penalized by actions of hackers and spammers. To be safe, completely remove any links the hackers may have added.

Q: What if this happened on my home machine?
A: All of the above still applies. You'll want to take extra care to clean it up; if you don't, it's likely the same thing will happen again. A complete re-install of the OS is ideal.


Additional resources you may find helpful:

Feel free to leave additional tips you have in the comments.

Wednesday, April 2, 2008

Improvements to iGoogle Gadgets for Webmaster Tools

by John Blackburn and Trevor Foucher, Webmaster Tools team

Update: The described feature is no longer available.


We launched Webmaster Tools iGoogle Gadgets last month with excitement—and curiosity. Would you find them useful? We thought you might appreciate the ability to create better "dashboards" for your sites, but there's no better way to tell for sure other than to get it out so you can use it.

After our initial release, we saw clear interest in the gadgets, and plenty of suggestions for improvement. So we've spent the past several weeks working on various areas. The biggest improvements are probably for those of you with more than one site: when you add a new tab of gadgets, your gadgets will now default to the site you were viewing when you added them to your iGoogle page. Additionally, gadgets now retain settings as a group, so if you change the site for any gadget in a group, the next time you refresh that page, all the gadgets will show data for that site. And gadgets now resize dynamically, so they take up less room.

Other cosmetic and usability improvements will benefit everyone. We shortened the tab's title to "Webmaster Tools" to save space on your iGoogle page, and added a Google logo/watermark to each gadget to help distinguish them as a group. We think the gadgets look a lot better, with alternating background colors for table rows to make them easier to differentiate, and improved layout in general. The Top Search Queries gadget now shows each query's position, too.


Some of you did tell us you want a 4-column layout, but we feel that the information we display—including some fairly wide URLS—is better suited to 3 columns. If you do want 4 columns, remember that you can choose "edit this tab" yourself to select an alternate layout. Update: The 4-column layout is no longer available in iGoogle.

We really appreciate your feedback on these early gadgets, and hope these improvements make them even more useful.  

As always, please let us know what you think.