Up until now, if you wanted your website to show up in the search engines, you submitted your home page URL to the engine’s crawler, and then waited for the friendly neighborhood search engine spider to come by and index your site. Sometimes the wait could take months. In some cases, you had the option of paying to get your site indexed; most search engines have moved away from paid inclusion programs, though Yahoo! still offers one. However you slice it, though, it is a frustrating process, particularly for those with little patience.
It is even more frustrating for those with very dynamic websites. If you have content that changes on an almost-daily basis, and the search engine spiders only visit your website once a week, you’re seeing some missed opportunities. Active bloggers face this problem with their sites, but so do firms with websites that focus on news, enthusiast sites that feature fresh content daily, and other commercial sites. It’s good to have lots of fresh content, but how do you make sure that search engines get wind of all of it as quickly as possible?
Not surprisingly, Google set itself to work on this very problem. As early as May 6, Shiva Shivakumar, Google’s engineering director, reported a possible solution in a blog that he wrote on Google’s website. In early June, the solution itself became more widely available. It’s called Google Sitemaps, and Shivakumar expects that it “will either fail miserably, or succeed beyond our wildest dreams, in making the web better for webmasters and users alike.” Though this free service is still in beta, it has already received some positive reviews from bloggers, who are either thinking about using it or already using it –- and it isn’t just for bloggers, either.
Google describes Google Sitemaps as “an experiment in web crawling.” It is a way for those with frequently-updated websites to inform Google as to when and how often they want the search engine to index their content. It is meant to supplement, not replace, the usual indexing of websites that Google already does on its own. Google hopes that it will help it succeed in its never-ending battle to index all publicly available information.
Webmasters sign up for the program at the home page for Google Sitemaps (https://www.google.com/webmasters/sitemaps/login). The introduction explains that users need to create a Sitemap in the correct format using the Sitemap Generator (https://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html). The generator helps you to create an XML Sitemap. You then place this file on your Web server, and update your Sitemap whenever you make changes to your site. Obviously, you have to tell Google where this Sitemap is, so the spider will know where to go. In addition to the URLs you want crawled, you can include information about the URLs, such as when the page last changed, how often the page changes, and the relative priority of the pages. It is possible to set it up so that Google is automatically informed when your Sitemap changes, so the spider can come by and index the newest version.
One interesting point about the Sitemap Generator is that it is an open source client in Python. And the project itself is being released under an Attribution/Share Alike Creative Commons license, with the idea that other search engines will pick it up to improve their own indexing of the Internet. For those who don’t know, Creative Commons is a not-for-profit developing flexible alternatives to the most restrictive forms of copyright –- rather like open source licenses themselves. An Attribution/Share Alike license allows users “to copy, distribute, display, and perform the work…to make derivative works…to make commercial use of the work…Under the following conditions: You must attribute the work in the manner specified by the author or licensor…If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one.”
Several other points are worth noting. First, the Sitemap Generator is intended to work regardless of the size of your website. So whether you have a simple blog site or millions of pages that are changing all the time, Google Sitemaps should be able to help you. Second, using Google Sitemaps will not increase your PageRank. Third, there is absolutely no guarantee, even with this program, that Google will crawl or index all of your URLs; remember, this is still a beta. Finally, as with some (though not all) Google betas, this one has a discussion/support page on Google Groups (http://groups-beta.google.com/group/google-sitemaps?hl=en), with 600 members as of this writing.
At least one blogger (Jeremy Zawodny) wondered why Google created a whole new system with XML rather than using ping services like Feedster and Technorati. Another blogger (Nathan Weinberg) believes that Google did so because such services would be woefully inefficient for the growth we can expect to see in the use of Google Sitemaps. Indeed, he stated his belief that many publishers do not like using ping services, or related RSS services, because of the control it forces them to give up, and that, for various reasons, RSS would be useless for Sitemaps.
Interestingly, though, according to Google’s FAQ about Sitemaps (https://www.google.com/webmasters/sitemaps/docs/en/faq.html), Google does support RSS. Indeed, despite creating an XML system, Google supports a number of formats for Sitemap submission, including the very simplest: a text file containing a list of URLs, with one URL per line. This might be inefficient –- and indeed, Google encourages webmasters to use its XML system –- but it does make the service more all-inclusive, and inclusion is, after all, the point.
In an interview with Danny Sullivan of Search Engine Watch, Shiva Shivakumar answered a question in a way that could raise potential concerns. The question was whether Google needs submitters of URLs to prove in some way that they are associated with the site for which they are submitting. Shivakumar responded that “We accept all the URLs under the directory where you post the Sitemap. For example, if you have posted a Sitemap at www.example.com/abc/sitemap.xml, we assume that you have permission to submit information about URLs that begin with www.example.com/abc/.” I don’t know if it is possible to hack such files, but if it is, this could be a security risk for any site using Sitemaps.
Another issue Sullivan raised was spam. Specifically, he wondered how Google would prevent people from using Google Sitemaps to spam the index in bulk. Shivakumar pointed out that Google is constantly developing new techniques for the management of index spam, and that those techniques would continue to apply with Google Sitemaps.
Finally, in the same interview, Sullivan wondered about Google’s future plans for Google Sitemaps. Would the company provide a reporting tool eventually, so that webmasters can tell what searches are sending them clicks? Shivakumar’s response was encouraging. “We are starting with some basic reporting, showing the last time you’ve submitted a Sitemap and when we last fetched it. We hope to enhance reporting over time, as we understand what the webmasters will benefit from.” He encouraged users to send the company ideas through the aforementioned Google Group covering Google Sitemaps.
For once, the answer to the question of why Google is doing this seems pretty obvious. According to its own corporate information, “Google’s mission is to organize the world’s information and make it universally accessible and useful.” Google Sitemaps is a direct extension of that mission; it makes it easier for webmasters to submit, and for Google to find, fresh information on websites.
But why is Google sharing the technology? Shivakumar stated in his blog that it was “so that other search engines can do a better job as well. Eventually we hope this will be supported natively in webservers (e.g. Apache, Lotus Notes, IIS).” While Google’s culture is such that I can believe it values open source, and would even be glad to see other search engines doing a better job, I think the key is getting native support on Web servers. It is well known that Apache is the most popular Web server on the Internet –- and Apache is open source. If Google truly wants to see usage of Google Sitemaps spread far and wide, sharing the technology like this is the fastest way to do it.
The easier it is to use Google Sitemaps, the more likely it is that webmasters will use it. Getting native support on popular Web servers would make Google Sitemaps easier to use. As more websites use Google Sitemaps, it will make Google’s job easier, too. As the project continues to develop, and eventually works it way out of beta, it should significantly shorten the amount of time webmasters and site owners must wait before new content is indexed –- and that should be easier on everybody’s nerves.