A site map should be one of the most important and best maintained pages on your site; it can help visitors to navigate your site effectively and quickly find the information that they are looking for, and it shows your visitors that you care about their surfing experience. A site map is also required to meet W3C accessibility standards. An easy to use and carefully designed site will help to ensure that visitors return to your site instead of getting frustrated and forgetting it.
A site map will also find favor with search engines as it is basically a list of all of the pages in your site. But in addition to your human-readable site map (well, browser-readable at least), there is also another type of Sitemap that you should consider using. This is known as a Google Sitemap and is a way for you to complement your existing site map for humans with something to make finding and indexing your site easier for bots and spiders.
While your human-digestible site map will generally be written in a language easily interpreted and rendered by browsers, a Google Sitemap will be written in a language designed to be understood by the automated trawlers that traverse the web discovering URLs. The language used is based upon the universal language of data transfer, which is XML. It is called the Sitemap Protocol and was created by Google to help facilitate and aid existing URL discovery methods.
It has also been designed to be interoperable between different search engines, not just Google, so once you have created your Google Sitemap, there may well be other search engines that you can submit it to. It was recently announced that both Yahoo! and MSN will support the protocol. Other than the Sitemap Protocol, you could also use the Open Archives Initiative Protocol for Metadata Harvesting, an RSS feed or a plain text file, but for the duration of this article we’ll be looking only at the open-source Sitemap Protocol.
One thing Google Sitemaps are not intended for is as a method of replacing traditional URL harvesting methods. Your Google Sitemap file is not meant to be used in place of any existing HTML site maps you may already have in place and it is not Google’s new way of indexing websites. The idea is that you submit information to Google (through your Google account) which tells them that you have a Sitemap file and where this file is located. Google will then send a crawler to your site to find the Sitemap file and use it to thoroughly index your whole site (or at least that part of it covered in the file). Google Sitemaps are also not a way of guaranteeing that your site is indexed at all, or a way of improving SERPs listings or Page Rank or anything else.
You may be asking yourself that if you can’t increase your rankings or guarantee a speedy indexing, "what is the point of using a Google Sitemap?" Using Google Sitemaps allows you to provide additional information about your pages, including when they were last updated, so if it has not been updated since it was last indexed, the crawler knows it does not need to be indexed again, saving both time and bandwidth. It is also useful for websites that use content which may otherwise be ignored, such as Flash-based navigation interfaces, the pages they link to being otherwise invisible to search engines. All in all, it is a good way of telling Google about your site and its URLs.
There are several different ways that you could create the sitemap file. You could download and configure the Google Sitemap Generator, use a third-party software application or create the file yourself manually. The Google Sitemap Generator is a Python script which utilizes a specially configured XML file to index your web site, create the Sitemap file and send the results directly to Google. This is a very quick and easy way of doing things, provided you have full access to running scripts from the web server that is hosting the site you’re making the Sitemap file for, which unless you run your very own web server, is unlikely.
There are also applications that can do the same thing for you and much more. Some will generate an HTML sitemap at the same time as the XML Sitemap file to save your having to do this. These third-party applications often require a paid-for license to use, so they may not be the best option for everyone. Finally, you can create your Sitemap file manually, which is the technique we are going to be looking at in this article.
All that you need to do to create a Google Sitemap file can be done with just a simple text editor, so open the one that you use and we’ll look at the code that is required. The Sitemap Protocol is written using XML, and as all valid XML files must begin with the XML declaration, this is what we start with:
<?xml version="1.0" encoding="UTF-8"?>
This just states the XML version in use and the type of encoding. Google Sitemap files must use UTF-8. The next element to appear must be the <urlset> element which describes the schema that the file must confirm to:
The namespace (xmlns) is a unique resource in the format of a URL that states the structure you are using when you create your Sitemap file. The next element is the <url> element and acts as a container element for other areas of your site. The child elements of the <url> element provide additional information about your pages. The first child element of the <url> element is the <loc> or location element, which defines each page with a unique identifier, namely its URL:
The data within the <loc> element must start with the protocol in use (HTTP in this case) and must end in a trailing slash if an individual page isn’t specified. You could specify your root directory, sub directories, or individual pages. So you could also use something like this:
Dynamically generated URLs can also be used but any entity characters (& < > ‘ and ") must be escaped correctly. The maximum size of any data in the <url> is 2048 characters which should be more than enough for most dynamic URLs. The following URL would be considered valid:
As you can see, the & character has been escaped using & other escape codes are ' for ‘, " for ", > for > and < for <.
The <loc> element is the only required element in any <url> element, but the optional elements, of which any or all can be used, are as follows:
The <lastmod> date must be in the W3C Datetime format and can include the time if desired. Valid date and time fragments for the Datetime format are:
Year – YYYY
Year and Month – YYYY-MM
Complete Date – YYYY-MM-DD
Complete Date, Hours and minutes – YYYY-MM-DDTHH:MMTZD
You can also include seconds and fractions of seconds if necessary. The date and time are separated by a literal T and the TZD stands for Time Zone Difference, which is the hours and minutes plus or minus from GMT. A full date and time could be:
The <changefreq> element can be any of the following values: always, hourly, daily, weekly, monthly, yearly or never. This value is just a guide to Google spiders. If you set the <changefreq> of every page to hourly, this doesn’t mean that a spider will be sent hourly to crawl your site.
The default priority, if this element is not specified, is 0.5. It can be any value between 0.0 and 1.0. This element is really only necessary on very large websites that visiting crawlers may not have time to index in full. The <priority> element is relative only to URLs in your domain, so marking all of your URLs with a priority of 1.0 means only that each page in your domain is of equal value, not that your URLs are more important than URLs in someone else’s Google Sitemap with a priority of 0.6. The pages with the highest priority in your domain will be indexed before pages with a lower priority.
A complete Sitemap file, from the examples above, would be as follows:
<?xml version="1.0" encoding="UTF-8"?>
Once your sitemap is created, you should upload it to the highest level directory to which you have access. It can be compressed using gzip to save on bandwidth and should be less that 10 MB in its uncompressed state. If your site is so big that 10 MB is not enough, you can create multiple sitemaps, but to do this you will also need to create a Sitemap Index file. This file follows a very similar format to the Sitemap files:
<?xml version="1.0" encoding="UTF-8"?>
It’s very similar as you can see, except the <urlset> is replaced by <sitemap index> and <sitemap> is used instead of <url>. The only optional tag is the <lastmod> tag, which follows the same date and time format as before.
Finally, you need to tell Google and the other supporting search engines that the Sitemap file exists. You can do this easily and quickly for Google using the Webmaster Tools section of the Submit Content section of Google’s Webmaster Central: www.google.com/webmasters/tools/. Once your site appears in the Site column, select the Add a Sitemap link, choose Add General Web Sitemap from the combo box and enter the URL of the Sitemap file. Your sitemap will then appear in the Sitemaps section of your website information and details are provided on its current status, when it was submitted and the URLs that were submitted. If there are any problems with your Sitemap file, they will be reported.
The methods for submitting to MSN and Yahoo! may vary considerably; in fact, a cursory check of both of these sites reveals no immediately apparent way to submit information relating to the location of your sitemap protocol file. Yahoo! does have a page where you can submit a plain text file to help with indexing your site, so perhaps this will be expanded soon to take in a URL leading to your sitemap file.
The Sitemap Protocol was released in July last year but has grown since then. For example, you can also specify Mobile Sitemaps for providing information on a site created for viewing on PDAs or mobile phones, and at the tail end of November last year, Google announced that if your English-language news site appeared in Google News, you could specify additional publication information using the News Sitemaps XML Definition. There’s no telling what else will be available in the future and which other search engines will take up support.