Duplicate Content in SEO: Detection and Prevention Techniques

Duplicate content in SEO is an old issue. Yet you can still find a lot of websites that experience problems with duplicate content issues. These issues have implications for your Google ranking. This article will show you how to discover, solve, and prevent duplicate content issues both within and outside of your website.

While duplicate content issues will not directly cause a Google ranking penalty, they have other consequences.

First, if you have content that is duplicated, Google may rank according to which page it believes is original and authoritative (determined by its algorithm). This might not be the correct page, or the page you would expect to rank. If that happens, the wrong URLs will show up in the search results. You might have already observed this in one of your websites.

Second, having lots of duplicate content on a website can make it inefficient for crawlers such as Googlebot to find unique content. Unique content, remember, has a positive effect on your website’s ranking. Googlebot might not revisit your website often if you have lots of duplicate content.

Third, duplicate content can dramatically increase the crawlable URLs of your website. Since there are a lot of bots that will crawl your content (e.g. Googlebot), this can eat up a significant portion of your website’s bandwidth and slow it down.

Fourth, your customers might have a problem understanding your website’s content if most of its pages are very similar to each other. This will decrease the value of your website in terms of uniqueness and clarity.

Fifth, cases of severe manipulation, such as building doorway pages (which is against Google’s quality guidelines), can lead to your site being banned in Google.

Sixth, many of the same issues can crop up with duplicate content outside of your website. Though most of the time you cannot prevent this from happening, you must take action if someone has copied your content without your permission, because that is against copyright law.

Finally, if you are just building several domains and copying/syndicating content from other websites, you will not get positive ranking results in Google. This approach, incidentally, is also against search engine guidelines.

This article will aim to establish techniques to detect duplicate content and also suggest some general ways to prevent it.

There are two types of duplicate content issues. Internal duplicate content can be found within your website. External duplicate content, on the other hand, can be found outside your domain. 

Let’s deal first with the type you can control, namely internal duplicate content.

You will not need to buy duplicate content checking software, as most of the great tools are free. For instance, you can simply use Google’s search engine to find it.

If you are checking for duplicate content inside your website, you need to start with your home page (the most important page), and then check your posts or content. Take the following steps:

Step 1. Go to your home page.

Step 2. Copy a short snippet, taken randomly, of 20 words found on your home page.

Here is an example snippet taken from the  http://www.php-developer.org/ home page:

“we offer free codes with detailed explanation of how it works so that it can help you a lot in”

Step 3. Go to Google and enter a query in the search box that follows the format of the example below.

Google search query command:

site:php-developer.org "we offer free codes with detailed explanation of how it works so that it can help you a lot in"

Step 4. Press the “search” button.

The search query causes Google to search for that snippet in its index within your own domain. To return more accurate results, instead of using site:www.php-developer.org, the query uses site:php-developer.org, so that Google will also check all of the indexed pages of the domain, including the root domain, sub-domain, blogs, etc.

If only the root domain is used (e.g http://www.php-developer.org/), Google cannot check any indexed pages included in forums.php-developer.org or tools.php-developer.org. You might have content that duplicates your home page within these sub-domains.

If you only see one search result in Google after doing the query, then your home page does not have duplicate content issues with the other pages of your website.

Here is a screen shot to illustrate this point:

Now that you are done with the home page, you can use the same checking technique for the rest of your important website pages. Just change the search query, as shown below.

Google search query command:

site:php-developer.org "this is a content snippet of your other website pages"

Here are some additional tips:

1. Most duplicate content pages use the same title tag. So you can also do a search query like this:
allintitle:  "This is the duplicate title tag you found" site:examplewebsite.org

Google returns all indexed pages that contain a duplicate title tag, which also counts as duplicate content. A good practice is to check the potential duplicate content of your home page — between / and /index.php, non-www version, and so forth. The search query example will be:

allintitle:  "This is your homepage title tag" site:yourexampledomain.org

Just replace the above home page title tag and domain with those for your own website.

2. If you have an ecommerce website, or any website that appends a session ID in the URL, you can also check to see if this session ID is causing duplicate content in your website by using the query below:

allinurl:  "session_id_used_by_your_website" site:example.org

Example:

allinurl:  "osCsid" site:bikefriday.com

The above reveals that there are indexed session IDs (osCsid) in this OsCommerce powered store : http://www.google.com/#hl=en&source=hp&q=allinurl%3A++%22osCsid%22+site%3Abikefriday.com+&btnG=Google+Search&aq=f&aqi=&aql=&oq=
&gs_rfai=&fp=8631cdd35a4d476d

If you need details on preventing duplicate content because of session IDs, you can read this tutorial: http://www.seochat.com/c/a/Search-Engine-Optimization-Help/Preventing-Duplicate-Content-on-an-ECommerce-Site-from-Session-IDs/

The duplicate content detected using the first method is only effective if Google actually indexes those problematic URLs. But if those duplicate content URLs are  still not indexed, then you cannot detect duplicate content properly.

This is where you will use Xenu Sleuth, which you can download here for free: http://home.snafu.de/tilman/xenulink.html You can use this software to crawl your website and search for duplicate content issues.

If you are not familiar with Xenu Sleuth, it is recommended that you read the introductory tutorials listed below.

Basic Introduction to Xenu Sleuth  

Process of Crawling in Xenu Sleuth 

The overall duplicate content detection steps you’ll need to take when using Xenu Sleuth are as follows:

Step 1. Launch Xenu Sleuth

Step 2. Go to File ==> Check URL.

Step 3. Enter the canonical home page URL. If you are using WWW version, then enter:

http://www.thisisyourwebsite.org/

Other settings:

  • Do NOT check the box “Check External Links.” This will make crawling very slow.
  • Do not do anything under “Include/Exclude.”

Screen shot:

Step 4. Click “OK” and then Xenu will start crawling all of your website’s URLs, starting from the home page URL. This can take a very long time for big websites.

Step 5. Once it is done, you will get the message “Link Sleuth finished. Do you want a report?”. Just click NO.

Step 6. it is important, however, to save the Xenu crawl file. Go to File ==> Save As ==> and then type a file name and the location in which you want to save the file. You can re-open this file without re-crawling your site’s URLs. Keep in mind, however, that if you have updated your website’s URLs, then the Xenu crawl file will not reflect the latest URLs for your website; in this case, you need to re-crawl for updated information.

Step 7. You can then export the file to csv or MS Excel so that you can finally use it for analysis. Go to File ==> Export to tab-separated file. Type the name of the file.

Step 8. You can now open the exported file in a spreadsheet application like MS Excel or Open Office Calc.

Step 9. There will be a lot of URLs to be exported depending on the size of your website. As a rule, filter the URLs pointing to external domains, so that all of the URLs in the “Address” column belong to your own domain.

In Column C, you can use the Spreadsheet auto-filter function; set it “Not equal” to “Skip to External.” Copy and paste the filtered result to a new worksheet:

Step 10. You can now start analyzing the entire spreadsheet data for duplicate content. It will be very easy if you follow the guidelines below.

First, the quickest way to check is to sort and arrange the title column alphabetically. If you see a lot of URLs that use the same title tags or highly similar title tags, open them in a web browser and compare their percentage of similarity using this tool: http://www.webconfs.com/similar-page-checker.php

If they are more than 70% similar, you should take some action, such as adding unique content, blocking in robots.txt, using link rel canonical tags, and so forth. We’ll go over prevention techniques later in this article. 

Second, some other things to look at are the presence of session Ids (under the “Address” column) as well as URLs that have exactly the same file size (which can be seen under the “Size” column). These are the signs of duplicate content URLs.
 

There are basically two useful approaches to this. The first involves using the Copyscape service, and the other employs Google.

Using Copyscape

To check for duplicate content of your website on other domains, go to Copyscape (linked above), enter your home page URL, and hit the “Go” button.

If you see the message “No results were found for this page. Click below to try some other pages on your site:” then Copyscape did not find duplicate content for that page. However, you may try entering other important URLs from your website (on a sampling basis only) to see if other sites have duplicated your content.

If Copyscape provides some potential duplicates, you need to reconfirm those, starting with the first result, using this tool:

http://www.webconfs.com/similar-page-checker.php

The purpose is to determine if their content is substantially duplicated. Sometimes Copyscape will only report % similarities of less than 50%, which is not substantial.

Using Google

You can also use Google to search for sites that duplicate content from your website. Simply use double quote " " from the content snippet example given earlier:

“we offer free codes with detailed explanation of how it works so that it can help you a lot in”

Then hit the Search button. If you can see results that are not part of your domain, then another site has duplicated your content. The expected result is your own domain URL, for example:

Preventing Duplicate Content

You can use any of the following prevention techniques depending on your website’s capability, platform and access:

1. Robots.txt = http://www.seochat.com/c/a/Search-Engine-Optimization-Help/Blocking-Complicated-URLs-with-Robotstxt/

2. .htaccess and php related 301 redirections = http://www.webconfs.com/how-to-redirect-a-webpage.php

3. Link rel canonical tag = http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

4. Meta noindex nofollow tag = http://www.robotstxt.org/meta.html

If you need to prevent a page from being indexed, while allowing Googlebot to crawl the page’s hyperlinks, you need to use meta noindex only:

<meta name="robots" content="noindex">

5. Parameter handling in Google webmaster tools = http://googlewebmastercentral.blogspot.com/2009/10/new-parameter-handling-tool-helps-with.html

6. For duplicates external to your domain, you need to contact the webmaster for removal of duplicate content pages in his/her domain.

Google+ Comments

Google+ Comments