Preventing Duplicate Content on an E-Commerce Site from Session IDs

It’s very important for your e-commerce website to show up well in the search engines. Unfortunately, you may be fighting duplicate content issues you’ve never even heard of — issues that can cause your site to rank poorly in Google and never even be seen by searchers. If your site uses session IDs to track visitors, you need to read this article. We’ll show you the problem, and provide you with several good solutions.

E-commerce keeps getting more and more popular.  According to latest survey by Nielsen (source: http://www.nielsen.com/media/2008/pr_080128b.html ), the number of Internet shoppers increased by 40 percent in just two years!

The growth and popularity of online shopping has encouraged the creation of many e-commerce web sites, selling different types of online products. To get sales, these sites need a lot of visitors coming to them from popular search engines such as Google.

To track visitors, these sites use session IDs, a weird and long combination of characters appended to their URLs. Every unique shopper on an e-commerce web site receives a unique session ID.

The session ID data will be used from the time the visitor starts shopping until they complete their check-out. Session IDs are used for security purposes, to ensure that all website transactions are traceable.

OsCommerce, a suite of open source shopping software, is one of the most popular e-commerce-based applications. It relies heavily on session IDs for its day-to-day shopping operation.

This article focuses on the correction of duplicate content in OsCommerce-based websites resulting from session IDs.

The Problem with Session IDs

Now what? Since these session IDs will be appended to the URL, they cause tremendous problems with search engines. Say you have this one canonical URL in your website:

http://www.mywebsite.com/buymymusic.html

This is the URL that needs to be indexed by Google, since it is the official version. But when Googlebot visits the site, your web server will then provide this URL (for example):

http://www.mywebsite.com/buymymusic.html?osCsid=4e2f1

Google indexes this URL — and to its eyes, it sees duplicate content, because the canonical site and the site it is looking at now have two different URLs. And again, when Googlebot visits the site, your web server might assign yet another session ID, for example:

http://www.mywebsite.com/buymymusic.html?osCsid=5c3g1

This process is repetitive and will make your site very difficult to understand from the search engines’ point of view, because they now see a lot of URLs containing the same content.

This means you face a serious duplicate content issue. The side effects of duplicate content caused by session IDs include:

  • The number of indexed URLs will increase in the Google index; this means Google will take longer to determine the important pages on your site, because its index is clouded with duplicate content URLs. 
  • It reduces the relevance score of your canonical URL. This is because the power of your canonical URL is being diluted with a lot of duplicate content URLs using session IDs. 
  • A low relevance score means lower rankings in Google, and lower rankings means less traffic. Less traffic means fewer sales online.

The easiest and fastest way to fix the Googlebot indexing session ID issue is to block session IDs in the robots.txt file. This is recommended only at the launch stage of your website. The primary reasons are:

  • Googlebot still has to find canonical URLs in your site for the first time, so it is wise to give them crawling directions, such as blocking those session IDs and providing them a list of canonical URLs in a sitemap or in navigational links. 
  • If you do this at a later stage of your website (when Googlebot is already indexing thousands of URLs with session IDs), you will probably lose some traffic you have from search engines.

Using robots.txt, we can formulate the syntax that will prevent the robot from crawling session ID URLs.

In our previous example, the Oscommerce session ID is in the form of:

http://www.mywebsite.com/buymymusic.html?osCsid=5c3g1

And the robots.txt syntax to block this URL will now be:

User-agent: *

Disallow: /*osCsid

Sitemap: http://www.mywebsite.com/sitemap.xml

Remember the important rule for robots.txt: it should be uploaded to the root directory of the site server. Read more about robots.txt in this article

What about a sitemap.xml file? This file also needs to be updated and uploaded to your site server root directory. Below are the important rules when selecting the URLs to be listed in the sitemap.xml file:

  • It should not contain session IDs.
  • It should only list the canonical URLs (do not include URLs that contain content that duplicates or is similar to the content at canonical URLs).
  • It should only list the most important pages in your site (excluding low value pages, such as thousands of Contact Us pages).
  • All URLs listed in sitemap.xml file should not be blocked in robots.txt

The dynamic version of the sitemap, such as sitemap.php or sitemap.asp, should not list URLs containing session IDs.

Oscommerce admin configuration includes a very useful feature called “Prevent Spider Sessions.” This feature works like this:

  • Googlebot visits the website URL containing a session ID.
  • The server will do the Apache mod rewrite then automatically 301 redirect URLs with session ID pointing to the canonical URL, so if Googlebot found this one:

    http:// www.yoursite.com/osc/specials.php?osCsid=cd5627128b63b13553aea5b6c2b3d65c

    The server will do a server side 301 redirect to http://www.yoursite.com/osc/specials.php . Therefore, instead of Googebot indexing URLs containing session IDs, they will crawl and index the canonical version (without a session ID).

This should be set up at the earliest stage of the website’s development. This is ideally done before allowing Googlebot to crawl the website’s pages.

To implement this solution, do the following: 

  • Log into your website oscommerce admin panel. 
  • Under Administration, you can find “Configuration.”
  • Under “Configuration,” you can find “Sessions.”
  • In “Sessions,” find one with “Prevent Spider Sessions” and click “Edit.”
  • In one of the Edit options, click “True” and click “Update.”

After editing, it should like the screen shot below:

 

To see the list of allowed spiders, navigate through /osc/includes , and find the spiders.txt file. Be careful about editing this, and always do a backup.

This is an excellent corrective action in the early stage of your site, when Googlebot has still not indexed it. Indeed, it is better to take this action than the one discussed in the previous section. 

However, if Googebot has already started indexing your site, along with the ugly session IDs, this solution can create duplicate content issues, because Googlebot will now index the canonical URLs, too. This will create content that duplicates what Google found at the already-indexed URLs with session IDs. 

To fix this issue permanently requires another corrective action, which I’ll discuss in the next section.

 

After many years of duplicate content desperation, Google finally came up with a solution that allows webmasters to specify their preferred (canonical) URLs.

<link rel="canonical" href="http://www.yourwebsite.com/yourpreferredurl.php" />

How does this work? It’s very simple. By placing this link rel canonical tag in any of your affected website template files, Google will know the URL you prefer to have indexed without either your or them being forced to deal with an in-depth technical solution.

For example:

http://www.yoursite.com/osc/products_new.php?osCsid=cf66b6d1ecc142348775790bef595556 , is indexed by Google. When Googlebot visits this URL again, and you have done the reconfiguration described in the previous section, it is now confused. 

Since http://www.yoursite.com/osc/products_new.php is the canonical URL, we will specify link rel=”canonical” tag in the products_new.php template.

Copy and paste this code to the affected template file:

<link rel="canonical" href="http://www.yoursite.com/osc/products_new.php" />

To do this, download the template file in your desktop to edit it (do not forget to backup!), and then upload it back to your server via FTP.

The link rel=”canonical” tag will be placed in the header section of the template file.

If this is done correctly, it should look like the screen shot below:

 

Link rel=”canonical” is a highly important solution for the following reasons:

  • It will transfer page rank and link juice.
  • It will also transfer other URL signals for establishing relevance in the search engines. This ensures that you will not lose any of your earned relevance and you will continue to rank well in search engines.

Conclusion

Duplicate content issues due to session IDs are serious, because they affect rankings in the search engines. All of the corrective actions outlined here are feasible but the most recommended actions include the following:

  • Create and use a sitemap containing the canonical list of your preferred URLs that does not use session IDs. An XML sitemap should be uploaded to the root directory and submitted to the Google Webmaster Tools Sitemap section.
  • Use “Prevent Spider Sessions” in the Oscommerce admin setup.
  • Specify your canonical URLs using the link rel=”canonical” tag in all of your affected website templates.

Using these recommendations, it is assured that when the Googlebot indexes a URL containing a session ID, your server will return the canonical URL. And furthermore, when the Googlebot re-indexes a URL containing a session ID (from the previous crawl in the past), it will know the official URL because of the rel=”canonical” tag, and then update the indexed URL to show your preferred version.

Google+ Comments

Google+ Comments