Monday, June 30, 2008

Improved Flash indexing

We've received numerous requests to improve our indexing of Adobe Flash files. Today, Ron Adler and Janis Stipins—software engineers on our indexing team—will provide us with more in-depth information about our recent announcement that we've greatly improved our ability to index Flash.

Q: Which Flash files can Google better index now?
We've improved our ability to index textual content in SWF files of all kinds. This includes Flash "gadgets" such as buttons or menus, self-contained Flash websites, and everything in between.

Q: What content can Google better index from these Flash files?
All of the text that users can see as they interact with your Flash file. If your website contains Flash, the textual content in your Flash files can be used when Google generates a snippet for your website. Also, the words that appear in your Flash files can be used to match query terms in Google searches.

In addition to finding and indexing the textual content in Flash files, we're also discovering URLs that appear in Flash files, and feeding them into our crawling pipeline—just like we do with URLs that appear in non-Flash webpages. For example, if your Flash application contains links to pages inside your website, Google may now be better able to discover and crawl more of your website.

Q: What about non-textual content, such as images?
At present, we are only discovering and indexing textual content in Flash files. If your Flash files only include images, we will not recognize or index any text that may appear in those images. Similarly, we do not generate any anchor text for Flash buttons which target some URL, but which have no associated text.

Also note that we do not index FLV files, such as the videos that play on YouTube, because these files contain no text elements.

Q: How does Google "see" the contents of a Flash file?
We've developed an algorithm that explores Flash files in the same way that a person would, by clicking buttons, entering input, and so on. Our algorithm remembers all of the text that it encounters along the way, and that content is then available to be indexed. We can't tell you all of the proprietary details, but we can tell you that the algorithm's effectiveness was improved by utilizing Adobe's new Searchable SWF library.

Q: What do I need to do to get Google to index the text in my Flash files?
Basically, you don't need to do anything. The improvements that we have made do not require any special action on the part of web designers or webmasters. If you have Flash content on your website, we will automatically begin to index it, up to the limits of our current technical ability (see next question).

That said, you should be aware that Google is now able to see the text that appears to visitors of your website. If you prefer Google to ignore your less informative content, such as a "copyright" or "loading" message, consider replacing the text within an image, which will make it effectively invisible to us.

Q: What are the current technical limitations of Google's ability to index Flash?
There are three main limitations at present, and we are already working on resolving them:

1. Googlebot does not execute some types of JavaScript. So if your web page loads a Flash file via JavaScript, Google may not be aware of that Flash file, in which case it will not be indexed.
2. We currently do not attach content from external resources that are loaded by your Flash files. If your Flash file loads an HTML file, an XML file, another SWF file, etc., Google will separately index that resource, but it will not yet be considered to be part of the content in your Flash file.
3. While we are able to index Flash in almost all of the languages found on the web, currently there are difficulties with Flash content written in bidirectional languages. Until this is fixed, we will be unable to index Hebrew language or Arabic language content from Flash files.

We're already making progress on these issues, so stay tuned!

Update in July 2008: Everyone, thanks for your great questions and feedback. Our focus is to improve search quality for all users, and with better Flash indexing we create more meaningful search results. Listed below, we’ve also answered some of the most prevalent questions. Thanks again!

Flash site in search results before improvements

Flash site after improved indexing, querying [nasa deep impact animation]

Helping us access and index your Flash files
@fintan: We verified with Adobe that the textual content from legacy sites, such as those scripted with AS1 and AS2, can be indexed by our new algorithm.

@andrew, jonny m, erichazann, mike, ledge, stu, rex, blog, dis: For our July 1st launch, we didn't enable Flash indexing for Flash files embedded via SWFObject. We're now rolling out an update that enables support for common JavaScript techniques for embedding Flash, including SWFObject and SWFObject2.

@mike: At this time, content loaded dynamically from resource files is not indexed. We’ve noted this feature request from several webmasters -- look for this in a near future update.

Update on July 29, 1010: Please note that our ability to load external resources is live.

Interaction of HTML pages and Flash
@captain cuisine: The text found in Flash files is treated similarly to text found in other files, such as HTML, PDFs, etc. If the Flash file is embedded in HTML (as many of the Flash files we find are), its content is associated with the parent URL and indexed as single entity.

@jeroen: Serving the same content in Flash and an alternate HTML version could cause us to find duplicate content. This won't cause a penalty -- we don’t lower a site in ranking because of duplicate content. Be aware, though, that search results will most likely only show one version, not both.

@All: We’re trying to serve users the most relevant results possible regardless of the file type. This means that standalone Flash, HTML with embedded Flash, HTML only, PDFs, etc., can all have the potential to be returned in search results.

Indexing large Flash files
@dsfdgsg: We’ve heard requests for deep linking (linking to specific content inside file) not just for Flash results, but also for other large documents and presentations. In the case of Flash, the ability to deep link will require additional functionality in Flash with which we integrate.

@All: The majority of the existing Flash files on the web are fine in regard to filesize. It shouldn’t be too much of a concern.

More details about our Flash indexing algorithm
@brian, marcos, bharath: Regarding ActionScript, we’re able to find new links loaded through ActionScript. We explore Flash like a website visitor does, we do not decompile the SWF file. Unless you're making ActionScript visible to users, Google will not expose ActionScript code.

@dlocks: We respect rel="nofollow" wherever we encounter it in HTML.

What are your SEO recommendations?

You may have noticed that we recently rewrote our article on What is an SEO? Does Google recommend them? Previously, the article had focused on warning people about common SEO scams to look out for, but didn't mention many of the valuable services that a helpful SEO can provide.

The article now notes some of the benefits of search engine optimization, and provides some guidance to site owners who are considering hiring an SEO. We'd also like to get your perspective: how would you define SEO? What questions would you ask a prospective SEO? What advice would you give to an inexperienced webmaster who's considering whether to contract an SEO? We'd like to hear your thoughts and incorporate your feedback if there's important advice that we should add.

Wednesday, June 25, 2008

Free online seminar: The Google Trifecta

Get the audio and Q&As from our recent live chat

Last Thursday, many of you just couldn't get enough of us and joined our second live Webmaster Central chat, "JuneTune." It was an action-packed session with live presentations, questions and answers and chatting about cats and other important topics. Over the course of an hour and a half, we made four presentations, received over 600 questions and passed around close to 500 chat messages. It was great to see so many Googlers around the world involved: Adam, Bergy, Evan, Jessica, Maile, Matt (Cutts), Matt (Dougherty), Reid and Wysz in Mountain View; Jonathan and Susan in Kirkland; Alvar, Mariya, Pedro and Uli in Dublin; and me in Zürich. We had users from about as many places as Matt (Harding) has danced in: Alaska, Argentina, Arizona, Australia, Brazil, California, Canada, Chile, Colorado, Costa Rica, Denmark, Egypt, Florida, France, Germany, Greece, Hawaii, India, Indonesia, Ireland, Israel, Italy, Malaysia, Mexico, Minnesota, Missouri, Nebraska, New York, New Zealand, Ohio, Pennsylvania, Peru, Philippines, Poland, Portugal, Saudi Arabia, South Africa, Spain, Switzerland, The Netherlands, Russia, Ukraine, United Kingdom, Vietnam and a bunch from Seattle, Washington - thank you all for joining us!

To help make the most out of this session, we'd like to make the transcripts and presentations available to everyone. We're also working on filling in the blanks and have started to answer many of the unanswered questions online in the Webmaster Help Group. You'll find the full (and just slightly cleaned up) questions and answers there as well.

I presented an overview of factors involved in personalized search at Google:

Maile gave a nice presentation of case sensitivity in web search in general:

The audio part of these presentations is in the audio transcript below. It also includes Jonathan's coverage of reasons why ranking may change, Wysz's presentation of ways to get URLs removed from our index, as well as everything else that happened on the phone! Enjoy :)

Audio transcript (MP3)

We hope to continue to improve on making these events useful to you, so don't forget to send us your feedback. We'll be back!

Tuesday, June 24, 2008

One year of monitored European Webmaster Help Groups

A year ago Google Guides from the Search Quality Team started monitoring Google Webmaster Help Groups in European languages. It has been an interesting time and a great, rewarding experience for all those involved :o). We have enjoyed building and growing our communities with webmasters across Europe and we hope you have as well. The feedback we have been receiving since the beginning has been encouraging, with many webmasters responding positively to the opportunity to be able to solve their indexing issues in their native language. Still, we are aware that we have a lot of work ahead of us. I would like to thank everyone who has been active in our communities and invite those who have not joined yet to do so. Sign up, participate and help us make a difference.

Taking advantage of this anniversary, I would like the European Google Guides to have a chance to say a few words about their communities.

Polish Webmaster Help Group
Guglarz, the Polish Google Guide
Our group grew very fast, especially in the last few months and we are well on the way to hitting 600 users! But what I appreciate even more then our group's success, is the open, friendly spirit we have been cultivating in the last twelve months :o). I feel like this is a community that is able to help with indexing issues and beyond!

I already pointed out some exceptional community members in the last blog post about communication with webmasters in a dozen languages. Since then, a few more have joined the list of tremendously committed, group-driving contributors. Bigu has been jumping in every time I've been traveling or very busy, which is the reason why I’d like to thank him a lot.

Also I want to acknowledge Maciej Gluszek for escalating language issues we used to struggle with. You guys help to improve the Google experience for everyone.

Lastly, I want to invite all group members to the chit chat section in order to introduce yourself and share some more information about who you are and what you do. I am really curious to find out more about the people I work with! I hope to see you in the group soon :o).

Группа помощи Google для веб-мастеров
Oxana, Tilek, Vitali, Mariya
Our Russian Webmaster Help group is growing by the day and just reached an important milestone - a whole kilobyte of members: 1024 (-:! Thanks to all our users for the great questions, the wisdom, and the humour which they bring to our community. Special thanks go to Web-Master, Crazy LionHeart, Andrey Morozov and Lmd for their spot-on comments and prolific posts - we really appreciate your dedication in helping others. We're very happy to be part of such a savvy and lively community, we look forward to more interesting questions and discussions in the future, and to reaching one megabyte of members (-:! С днем рождения!

Google Webmester Súgó Csoport
Tibor, János
Although the Hungarian Webmaster Help Group is one of our youngest, it already counts 300+ members, and has a pretty high number of lively discussions. The atmosphere is really great and loads of issues are resolved thanks largely to our dedicated and wise users. I would especially like to thank snomag for helping greatly in building the group up in the beginning (where have you disappeared to? We miss you! :)

Google Site Yöneticisi Yardım Grubu
The Turkish Webmaster Help Group has grown into a dynamic and effective community of fantastic webmasters in just four months. I want to thank to everyone in the group who has been eager to share his/her knowledge to help others. Special thanks to Erkan (man_blood) who has written a huge amount of posts in the last three months; Merve who has helped the group with her technical skills as well as calm spirit, especially when I was on vacation; Salih (SesVer) who used to help a lot when the group was first launched; and many more who are not mentioned by name here. It has been exciting and a pleasure to be a part of this rapidly growing family! Desteğiniz için teşekkürler :-)

קבוצת העזרה למנהלי ומנהלות אתרים
Alon, your Google Guide
Our Hebrew group is maturing. To date we have more than 200 users and we are gaining visibility while more users are using the well-informed discussions we produce.

It is a pleasure to see so many webmasters stopping by to ask, learn and help others. I would like to take the opportunity to thank all of you who make the group a place for discussion and collaboration. My special thanks go to Shoshanna, AliaG, Tomer and Seosos, for their great contribution.

By browsing through the “tell us about yourself ” thread you can learn more about our community where many members introduce themselves. You're also welcome to watch my short video giving some specific tips for Israeli Webmasters.

Google diskussionsgruppe for webmasters
GoogleGuide (Jonas)
The Danish group has been slowly gaining more subscribers since its launch last year, and we still hope to attract more people. We've grown almost 100% in the past year, and we hope to keep up the steam! (: This growth wouldn't have been possible without the help of Anders who contributes equal amounts of webmaster goodness in the form of sharing his tips on 301's and moving content to a new domain (which has been the hottest and most debated topic in the group), to issues on getting your site better indexed and crawled. So a big thanks to Anders, and all the others who have made this group useful for everybody.

Ajuda a Webmasters do Google
The Portuguese Webmaster Help Group is now above 1200 members. The success of the group is a reflection on the great community built by every single contributor. Special thanks to Carlos Lavieri, Rodrigo Soares, Leandro Leite e João Carlos (Orquiza) who used to help so much and recently Flávio Raimundo (M&S) and Ruben Zevallos Jr.

Recently I had the chance of giving a few public interviews to Portuguese speaking webmasters and recognizing the group and its success publicly. I'd also like to take the opportunity to thank my interviewers for their great questions; if you speak Portuguese or you are a Google Translator fan you can read them: Entrevista com Pedro Dias (by Paulo Teixeira) and Entrevista a Pedro Dias (by Kazulo Webmarketing). It is a great pleasure to be a part of such an awesome community.

Googlen Keskusteluryhmä Verkkovastaaville
The Finnish Webmaster Help Group will be celebrating its first birthday soon. During the first year, we've built a small but dynamic community and covered various topics from general webmaster issues to language-related toughies. A huge thank you to everyone who has taken part in discussions! I'm looking forward to starting our second year as a group, hope to see both familiar faces and newcomers join in our community!

Groupe Google d'entraide pour les webmasters
Alec, Nathalie

The French Webmaster Help Group was launched one year ago and we've had a tremendous growth thanks to all our fantastic members. Special thanks to Paul-Arnaud and Cyrille who used to help so much and to Thierry who is now our bionic poster :-)

Recently I had the chance to discuss the Webmaster Help Group with a French SEO I met at SES Paris '08. It was so great to connect again on the Internet and help him with his issues.

Swedish Webmaster Help Group
Hessam, the Swedish Google Guide
The small Swedish webmaster forum, with currently almost 200 users, has almost tripled its size since its launch last July. During the past year we have seen a steady growth in user participation and a number of questions from our Swedish webmasters. I also had the chance to attend Search Marketing Expo (SMX) Stockholm 2007 which was definitely the highlight of the year. There I was able to meet a number of Swedish webmasters to talk about Google, our Webmaster Tools as well as the cold Swedish winter. During the coming year we are determined to further grow the community to better support our users. Lastly, a big thanks goes to everyone who has supported the group, especially Mattias Nordin and Widar Nord who have contributed with their knowledge and time. I am looking forward to hearing more from our regulars and would like to extend an invite to new members.

Google Diskussionsforum für Webmaster
With more than 2700 members and almost a thousand posts per month, the German Webmaster Help Group has developed into a vibrant community. You will find many very savvy webmasters there who are always prepared to discuss all webmaster-related issues and offer advice to both beginners and advanced webmasters.

I would like to take this opportunity to thank everyone who has contributed to this forum. Some webmasters like luzie, alpaka, beejees, Garzosch have been a crucial part of our group since this group began, and with the likes of Zombie joining in October, Mikki in January and Sistrix in March after having the pleasure of meeting him at SMX in Munich, we have an ever-growing pool of wisdom :-)

Foro de Google para Webmasters
Rebecca, Alvar
The Spanish webmaster group now has almost 2500 members. We have some wonderful webmasters that offer great advice and support, people like Ricardo Antonio, Enrique de Mesa, Jesús Cáceres (chupi), Jose Antonio Nobile (nobilesoft), Carles Pastor, and many others who help make this forum work as well as it does. Recently we had one member visit whose site had been hacked with a cloaking technique and the community was able to help point out how to detect the hidden links and clear the hacked content. It’s great to see his website clean again and up and running.

If you want to know who is behind this forum, have a look at the video where we introduce the forum and Webmaster Tools. Additionally, please give any feedback you may have in the thread where we present the video. By the way, it was great meeting some of you guys face to face at the SMX Madrid conference.

Google Discussiegroep voor Webmaster
Andre, Jos
Our Dutch Webmaster Help Group now has well over 700 members. Some of our regular posters, like Marketsharer, Janus and Joop are doing a great job helping a lot of our less experienced (as well as more experienced!) webmasters with their questions. Recently, some Dutch webmasters have been dealing with the fact that their sites have been hacked. We've added a sticky post to the group to help webmasters clean up their sites after getting hacked, so if you are dealing with this issue, do have a look at the group! At this moment, the group is still steadily growing and we are really looking forward to reaching the 1000 members mark!

Google Assistenza Webmaster
Lella, Alessandro, Stefano
The Italian webmaster group is about to turn one year old pretty soon and has been a great success. We have a bunch of fabulous witty power posters like Angelo Palma (Angelo), Marco (Customsoft) and others that keep on massively contributing to the success of the group day by day. Over the last year we achieved excellent results both in terms of traffic increase and new users posting frequently. Coordinating the localization of the post on paid links - later translated into 14 languages - was a turning point for the Webmaster Help team. So was Lella's blog post in English, German and Italian about the Search Engine Strategies Conference in London, where she represented Google as a speaker.

The European Google Guides Team
Webmaster Help Group Guides and Domo-Kun, the team mascot

Speaking on behalf of all our Google Guides, we are looking forward to continued collaboration with webmasters all over the world. Surely the upcoming 12 months are going to be exciting for the Google Help Groups as the communities continue to grow. Thanks again to all the great webmasters out there and see you soon in the discussions!

Friday, June 20, 2008

Get Cooking with the Webmaster Tools API

As the days grow longer and summer takes full stage, many of us are flocking to patios and parks to engage in the time-honored tradition of grilling food. When it comes to cooking outdoors, the type of grills used span a spectrum from primitive to high tech. For some people a small campfire is all that's required for the perfect outdoor dining experience. For other people the preferred tool for outdoor cooking is a quad grill gas-powered stainless steel cooker with enough features to make an Iron Chef rust with envy.

An interesting off-shoot of outdoor cooking techniques is solar cooking, which combines primitive skills and modern ingenuity. At its most basic, solar cooking involves creating an "oven" that is placed in the sun and passively cooks the food it contains. It is simple to get started with solar cooking because a solar oven is something people can make themselves with inexpensive materials and a bit of effort. The appeal of simplicity, inexpensiveness and the ability to "do it yourself" has created a growing group of people who are making solar ovens themselves.
How all this relates to webmasters is that the webmaster community is also made up of a diverse group of people who use a variety of tools in a myriad of ways. Just like how within the outdoor cooking community there's a contingent of people creating their own solar ovens, the webmaster community has a subgroup of people creating and sharing their own tools. From our discussions with webmasters, we've consistently heard requests to open Webmaster Tools for third-party integration. The Webmaster Tools team has taken this request to heart and I'm happy to announce that we're now releasing an API for Webmaster Tools. The supported features in the first version of the Webmaster Tools API are the following:
  • Managing Sites
    • Retrieve a list of your sites in Webmaster Tools

    • Add your sites to Webmaster Tools

    • Verify your sites in Webmaster Tools

    • Remove your sites from Webmaster Tools

  • Working with Sitemaps
    • Retrieve a list of your submitted Sitemaps
    • Add Sitemaps to Webmaster Tools

    • Remove Sitemaps from Webmaster Tools

Although the current API offers a limited subset of all the functionality that Webmaster Tools provides, this is only the beginning. Get started with the Developer's Guide for the Webmaster Tools Data API to begin working with the API.

Webmasters... fire up your custom tools and get cooking!

A new layer to Google Trends

Two years ago, we launched Google Trends, a tool that lets anyone see what the world is searching for, and compare the world's interest in your favorite topics. Last year, we added Hot Trends, which shows what people are searching for right now - the fastest rising search queries on Google, updated every hour. And just last week, we introduced normalized search volume numbers available for export in Google Trends.

Today, we add a new layer to Trends with Google Trends for Websites, a fun tool that gives you a view of how popular your favorite websites are, including your own! It also compares and ranks site visitation across geographies, and related websites and searches.

Let's take a look at one example, the release of Radiohead's In Rainbows album. As part of our annual Zeitgeist, we post the fastest rising search terms, and this past year, radiohead took the crown as the fastest rising search term in the last quarter of 2007.

Using Google Trends, we can see the search volume for radiohead compared to in rainbows over the last 12 months.

radiohead vs. in rainbows

With Hot Trends, we can see that on October 10th (the release date of In Rainbows), people were most interested in downloading In Rainbows and reviews of the new release.

Now, using Trends for Websites, this story can be viewed from another perspective: we can see how the number of unique visitors that visit and has changed over the last 12 months, the countries where the sites are most popular, the top related sites and search terms. vs.

We can see that in October 2007, (the blue line) saw a huge surge in popularity. The release of In Rainbows clearly drove many people to visit the band's website. The album's website, (the red line), saw an even more dramatic jump; probably because that is where you could actually download the new album. And, it looks like site traffic to has increased overall, even as traffic declined.

Keep in mind that Trends for Websites is a Google Labs product and that we are experimenting with ways to improve the quality of the data. Because data is estimated and aggregated over a variety of sources, it may not match the other data sources you rely on for web traffic information. For more information, be sure to check out our Website Owners FAQs.

To start using Trends for Websites, head over to Google Trends.

Monday, June 16, 2008

Join us for another live chat - June 19, 2008

Written by Adam Lasnik, Search Evangelist

When it comes to talking with webmasters, we just can't get enough.

This past March we had the pleasure of connecting with hundreds of you online in our first-ever Webmaster Help Group online live chat, which included a presentation on Images in Google Search by Maile, lots of feedback from you on webmaster issues, a site clinic, and dozens of questions answered by folks from our Google Webmaster Central team.

Given the success of this previous chat, we've decided to do it again.  We're hosting another free live chat (dubbed JuneTune), and we'd love to have you attend!

Here's what you'll need:

What will our JuneTune chat include?  

  • INTRO:  A quick hello from some of your favorite Help Group Guides
  • PRESO:  A presentation on Personalization in Google Search by our own John Mueller.
  • FAQs:  We're calling this "Three for Three," and we'll have three different Googlers tackling three different issues we've seen come up in the Group recently.  What will they be?  You'll just have to attend the chat to find out!
  • And lots of Q&A!  You'll have a chance to type questions during the entire session, and we'll pick as many as we can to answer in writing and in speaking during the chat.

When and how can you join in?

  • Mark the date on your calendar now: Thursday, June 19, for about one hour starting at 2:00pm PDT / 5:00pm EDT / 21:00 UTC / 23:00 CET 
  • Register right now for this event. Please note that you'll need to click on the "register" link on the lefthand side.
  • Using the link e-mailed to you by WebEx (the service hosting the event), log in 5-10 minutes prior to 2pm PDT.

We hope you can stop by, and look forward to chatting with you!  In the meantime, if you have any questions, feel free to post a note in this Groups thread.

Monday, June 9, 2008

Duplicate content due to scrapers

Since duplicate content is a hot topic among webmasters, we thought it might be a good time to address common questions we get asked regularly at conferences and on the Google Webmaster Help Group.

Before diving in, I'd like to briefly touch on a concern webmasters often voice: in most cases a webmaster has no influence on third parties that scrape and redistribute content without the webmaster's consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our webmaster guidelines. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.

Generally, we can differentiate between two major scenarios for issues related to duplicate content:
  • Within-your-domain-duplicate-content, i.e. identical content which (often unintentionally) appears in more than one place on your site

  • Cross-domain-duplicate-content, i.e. identical content of your site which appears (again, often unintentionally) on different external sites
With the first scenario, you can take matters into your own hands to avoid Google indexing duplicate content on your site. Check out Adam Lasnik's post Deftly dealing with duplicate content and Vanessa Fox's Duplicate content summit at SMX Advanced, both of which give you some great tips on how to resolve duplicate content issues within your site. Here's one additional tip to help avoid content on your site being crawled as duplicate: include the preferred version of your URLs in your Sitemap file. When encountering different pages with the same content, this may help raise the likelihood of us serving the version you prefer. Some additional information on duplicate content can also be found in our comprehensive Help Center article discussing this topic.

In the second scenario, you might have the case of someone scraping your content to put it on a different site, often to try to monetize it. It's also common for many web proxies to index parts of sites which have been accessed through the proxy. When encountering such duplicate content on different sites, we look at various signals to determine which site is the original one, which usually works very well. This also means that you shouldn't be very concerned about seeing negative effects on your site's presence on Google if you notice someone scraping your content.

In cases when you are syndicating your content but also want to make sure your site is identified as the original source, it's useful to ask your syndication partners to include a link back to your original content. You can find some additional tips on dealing with syndicated content in a recent post by Vanessa Fox, Ranking as the original source for content you syndicate.

Some webmasters have asked what could cause scraped content to rank higher than the original source. That should be a rare case, but if you do find yourself in this situation:
  • Check if your content is still accessible to our crawlers. You might unintentionally have blocked access to parts of your content in your robots.txt file.

  • You can look in your Sitemap file to see if you made changes for the particular content which has been scraped.

  • Check if your site is in line with our webmaster guidelines.
To conclude, I'd like to point out that in the majority of cases, having duplicate content does not have negative effects on your site's presence in the Google index. It simply gets filtered out. If you check out some of the tips mentioned in the resources above, you'll basically learn how to have greater control about what exactly we're crawling and indexing and which versions are more likely to appear in the index. Only when there are signals pointing to deliberate and malicious intent, occurrences of duplicate content might be considered a violation of the webmaster guidelines.

If you would like to further discuss this topic, feel free to visit our Webmaster Help Group.

For the German version of this post, go to "Duplicate Content aufgrund von Scraper-Sites".

Tuesday, June 3, 2008

The Impact of User Feedback, Part 1

About a year ago, in response to user feedback, we created a paid links reporting form within Webmaster Tools. User feedback, through reporting paid links, webspam, or suggestions in our Webmaster Help Group, has been invaluable in ensuring that the quality of our index and our tools is as high as possible. Today, I'd like to highlight the impact that reporting paid links and webspam has had on our index. In a future post, I'll showcase how user feedback and concerns in the Webmaster Help Group have helped us improve our Help Center documentation and Webmaster Tools.

Reporting Paid Links

As mentioned in the post Information about buying and selling links that pass PageRank, Google reserves the right to take action on sites that buy or sell links that pass PageRank for the purpose of manipulating search engine rankings. Even though we work hard to discount these links through algorithmic detection, if you see a site that is buying or selling links that pass PageRank, please let us know. Over the last year, users have submitted thousands and thousands of paid link reports to Google, and each report can contain multiple websites that are suspected of selling links. These reports are actively reviewed, and the feedback is invaluable to improve our search algorithms. We also are willing to take manual action on a significant fraction of paid link reports as we continue to improve our algorithms. More importantly, the hard work of users who have already reported paid links has helped improve the quality of our index for millions. For more information on reporting paid links, check out this Help Center article.

Reporting Webspam

Google has also provided a form to report general webspam since November 2001. We appreciate users who alert us to potential abuses for the sake of the whole Internet community. Spam reports come in two flavors: an authenticated form that requires registration in Webmaster Tools, and an unauthenticated form. We receive hundreds of reports each day. Spam reports to the authenticated form are given more weight and are individually investigated more often. Spam reports to the unauthenticated form are assessed in terms of impact, and a large fraction of those are reviewed as well. As Udi Manber, VP of Engineering & Search Quality mentioned in his recent blog post on our Official Google Blog, in 2007 more than 450 new improvements were made to our search algorithms. A number of those improvements were related to webspam. It's not an understatement to say that users who have taken the time to report spam were essential to many of those algorithmic enhancements.

Going forward

As users' expectations of search increase daily, we know it's important to provide a high quality index with relevant results. We're always happy to hear stories in our Webmaster Help Group from users who have have reported spam with noticeable results in our Webmaster Help Group. Now that you know how Google uses feedback to improve our search quality, you may want to tell us about webspam you've seen in our results. Please use our authenticated form to report paid links or other types of webspam. Thanks again for taking the time to help us improve.

Improving on Robots Exclusion Protocol

Web publishers often ask us how they can maximize their visibility on the web. Much of this has to do with search engine optimization -- making sure a publisher's content shows up on all the search engines.

However, there are some cases in which publishers need to communicate more information to search engines -- like the fact that they don't want certain content to appear in search results. And for that they use something called the Robots Exclusion Protocol (REP), which lets publishers control how search engines access their site: whether it's controlling the visibility of their content across their site (via robots.txt) or down to a much more granular level for individual pages (via META tags).

Since it was introduced in the early '90s, REP has become the de facto standard by which web publishers specify which parts of their site they want public and which parts they want to keep private. Today, millions of publishers use REP as an easy and efficient way to communicate with search engines. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and in the way it works for any publisher, no matter how large or small.

While REP is observed by virtually all search engines, we've never come together to detail how we each interpret different tags. Over the last couple of years, we have worked with Microsoft and Yahoo! to bring forward standards such as Sitemaps and offer additional tools for webmasters. Since the original announcement, we have, and will continue to, deliver further improvements based on what we are hearing from the community.

Today, in that same spirit of making the lives of webmasters simpler, we're releasing detailed documentation about how we implement REP. This will provide a common implementation for webmasters and make it easier for any publisher to know how their REP directives will be handled by three major search providers -- making REP more intuitive and friendly to even more publishers on the web.

So, without further ado...

Common REP Directives
The following list are all the major REP features currently implemented by Google, Microsoft, and Yahoo!. With each feature, you'll see what it does and how you should communicate it.

Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.

1. Robots.txt Directives
Disallow Tells a crawler not to index your site -- your site's robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled 'No Crawl' page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.
Allow Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
$ Wildcard Support Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages 'No Crawl' files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
* Wildcard Support Tells a crawler to match a sequence of characters 'No Crawl' URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Sitemaps Location Tells a crawler where it can find your Sitemaps Point to other locations where feeds exist to help crawlers find URLs on a site

2. HTML META Directives
NOINDEX META Tag Tells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no snippet for the page on Search Results
NOARCHIVE META Tag Tells a search engine not to show a "cached" link for a given page Do not make available to users a copy of the page from the Search Engine cache
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and snippet for this page

These directives are applicable for all forms of content. They can be placed in either the HTML of a page or in the HTTP header for non-HTML content, e.g., PDF, video, etc. using an X-Robots-Tag. You can read more about it here:X-Robots-Tag Post or in our series of posts about using robots and Meta Tags.

Other REP Directives
The directives listed above are used by Microsoft, Google and Yahoo!, but may not be implemented by all other search engines. In addition, the following directives are supported by Google, but are not supported by all three as are those above:

UNAVAILABLE_AFTER Meta Tag - Tells a crawler when a page should "expire", i.e., after which date it should not show up in search results.

NOIMAGEINDEX Meta Tag - Tells a crawler not to index images for a given page in search results.

NOTRANSLATE Meta Tag - Tells a crawler not to translate the content on a page into different languages for search results.

Going forward, we plan to continue to work together to ensure that as new uses of REP arise, we're able to make it as easy as possible for webmasters to use them. So stay tuned for more!

Learn more
You can find out more about robots.txt in our documentation and at Google's Webmaster help center, which contains lots of helpful information, including:We've also done several posts in our webmaster blog about robots.txt that you may find useful, such as:There is also a useful list of the bots used by the major search engines.

To see what our colleagues have to say, you can also check out the blog posts published by Yahoo! and Microsoft.

Sunday, June 1, 2008

How Google defines IP delivery, geolocation, and cloaking

Many of you have asked for more information regarding webserving techniques (especially related to Googlebot), so we made a short glossary of some of the more unusual methods.
  • Geolocation: Serving targeted/different content to users based on their location. As a webmaster, you may be able to determine a user's location from preferences you've stored in their cookie, information pertaining to their login, or their IP address. For example, if your site is about baseball, you may use geolocation techniques to highlight the Yankees to your users in New York.

    The key is to treat Googlebot as you would a typical user from a similar location, IP range, etc. (i.e. don't treat Googlebot as if it came from its own separate country—that's cloaking).

  • IP delivery: Serving targeted/different content to users based on their IP address, often because the IP address provides geographic information. Because IP delivery can be viewed as a specific type of geolocation, similar rules apply. Googlebot should see the same content a typical user from the same IP address would see.

    (Author's warning: This 7.5-minute video may cause drowsiness. Even if you're really interested in IP delivery or multi-language sites, it's a bit uneventful.)

  • Cloaking: Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines. If the file that Googlebot sees is not identical to the file that a typical user sees, then you're in a high-risk category. A program such as md5sum or diff can compute a hash to verify that two different files are identical.

  • First click free: Implementing Google News' First click free policy for your content allows you to include your premium or subscription-based content in Google's websearch index without violating our quality guidelines. You allow all users who find your page using Google search to see the full text of the document, even if they have not registered or subscribed. The user's first click to your content area is free. However, you can block the user with a login or payment request when he clicks away from that page to another section of your site.

    If you're using First click free, the page displayed to users who visit from Google must be identical to the content that is shown to the Googlebot.
Still have questions?  We'll see you at the related thread in our Webmaster Help Group.