Create a Customized Google XML Sitemap

An XML sitemap is a type of sitemap Google suggests that you submit to them through Google Webmaster tools. This is not the kind of web sitemap you usually see when you are visiting a website. Rather, it is the kind of sitemap that helps Google crawl your site better. This article will show you how to generate one.

Note that this kind of sitemap is supposed to NOT be accessible by visitors. For example, you should not be putting a link to this XML sitemap where visitors can access it. This XML sitemap is meant for search engine bots only, like Googlebot.

If you are still confused and a newbie when it comes to sitemaps, see the obvious difference below:

The purpose of an XML sitemap is to inform Googlebot of all canonical URLs in your website, which they may not find in a normal crawling process. There are certain important benefits to having an XML sitemap:

First, when combined with robots.txt in preventing duplicate content, an XML sitemap reconfirms your canonical URLs and will give your site better exposure in the Googlebot crawling process. If you eliminate duplicate content URLs in your website by proper use of an XML sitemap (combined with robots.txt, link rel canonical methods and redirection methods), you will preserve link equity and those earned link juices will be diverted to important pages that need to rank well in Google.

Second, if you have a big website (like a giant e-commerce website) with thousands of product and category URLs, it seems impossible to list all of them in a normal web sitemap. This is where the XML sitemap can be more helpful; it can display a large number of URLs to inform the Googlebot about the URLs in your website. This is particularly helpful to newly-launched websites.

This article attempts to create a customized/manual Google XML sitemap using Excel according to a Google standard This “customization” of a Google sitemap can correct problems encountered when using the traditional methods to create a Google sitemap (for example, using free XML sitemap creator services).

A traditional XML sitemap generator can cause at least two potential problems. First, they tend to include non-canonical URLs. For example, if http://www.thisisyourwebsite.com/product.php?id=5 and http://www.thisisyourwebsite.com/latestprogrammingbooks.html are basically URLs containing the same content, a popular XML generator often includes both of them in the list, with almost no way to customize or declare canonical URLs.

Second, a traditional XML sitemap generator consumes a lot of time crawling unimportant URLs. If you have a big website, using these services can be time consuming (some may take several hours to complete their crawling process). The worst part is that most of the results are unimportant URLs, such as search result URLs, advanced search URLs or thousands of “contact us” URLs.

Proposed solution

The following is a proposed method of customizing your Google XML sitemap which solve the problems encountered when using traditional sitemap generator. It is a 13-step process.

Step 1: Crawl your website using Xenu sleuth; you can download it at the link provided. Install it in your computer, and then go to File -> Check URL and enter the root URL of your website, for example:

http://www.php-developer.org/

Or if it is a sub-domain, enter the root of the sub-domain, for example:

http://tools.devshed.com/

DO NOT check “Check External Links.” After all of this is set, click “OK.”

Step 2: Xenu sleuth will then crawl your website. Once it finishes, it will show this message:

When you see this message, click “NO.”

Step 3. In Xenu sleuth, go to File -> Export to TAB separated file -> file name, and use the domain name of your website as the file name. You can save it in a convenient location, such as Desktop. Use “Text files (*.txt)” as the file type. Finally, click “save.” This will create an Excel compatible .csv file.

Step 4. To save your Xenu sleuth crawl session, go to File -> Save as -> file name; you can still use the domain name as the file name, but this time, use *.xen as the file type. You can re-open this session again without the need to re-crawl your website with Xenu. This saves bandwidth.

Step 5. You can safely close your Xenu sleuth session, by going to File -> Exit.

Step 6. In your desktop, look for *.txt file you have just saved (example: PHP developer.txt). Right click on the file and select “Open with;” choose Microsoft Excel. See the screen shot below:

Step 7. The file you have just opened is not in Excel format, so you need to save it as Excel. Go to File -> Save As -> file name; you should still use the domain name, but choose Microsoft Excel Workbook under “save as type.” For convenience it can still be saved on the desktop.

Step 8. Go to Data -> Filter -> Auto filter. This will activate the Excel drop down filter. First we need to filter “External URLs” because they are not part of your website’s URLs. In column C, click the drop down arrow and select “Custom.”

Under “Status Text,” choose “does not contain” and on the right drop down, choose “skip external.” (See screen shot above)

Step 9. On Column D, do the same filtering method you used in step 8, but this time, under “Type,” select “contains” and, on the right drop down menu, choose “text/html.” This will display only “text/html URLs” which Google recommends that you index. 

Step 10. On Column A, you can further use Custom filter (from Excel’s auto filter) to continue removing unimportant rows which are not recommended for indexing. For example, you can remove image URLs like:

http://www.php-developer.org/wp-includes/images/smilies/icon_smile.gif

To filter .gif files, click the Column A drop down filter (which can be seen when auto filter is enabled; see screen shot on previous page) and then select “does not contain” ? “.gif”. You can extend your filtering to other unimportant file extensions such as .js, .css, .xml, .doc, and other non-HTML related file types.

Step 10. Select the filtered results in Column A, and copy and paste them to a notepad file. The copy and paste process should look like this:

Step 11. You need to make sure that the filtered URLs (they are not yet the final canonical URLs) are NOT BLOCKED in robots.txt. To do this, go to your Google webmaster Tools account -> Site configuration -> Crawler Access and copy and paste the URLs from the notepad under “URLs Specify the URLs and user-agents to test against.”

You also need to make sure that the robots.txt syntax is updated under “Text of http://www.thisisyourwebsite/robots.txt”

See the screen shot below:

When everything has been set, click the “Test” button. If there are blocked URLs (e.g Blocked by line…), remove those URLs in your notepad files. The resulting URLs are the canonical URLs for your website.

Step 12. Make sure all of those URLs have a header status of “200 OK.” Do not include URLs which will redirect to another page; instead, put in the target URL. You can use a bulk checker. Remove any URL that does not give 200 OK status.

Step 13. Once the canonical URLS are well identified, you are ready to create your custom XML sitemap. Go to this URL: http://www.php-developer.org/PHPXML-sitemap-generator.php , and then copy and paste all of the canonical URLs based on the analysis from Step 1 to Step 12. Make sure you have entered the home page URL first, and one URL per line.

Google+ Comments

Google+ Comments