Handling Duplicate Content

Duplicate content is identical or almost identical content found on the same website or other websites. Ideally, Google wants to feature only one version of the same content and will usually select the oldest, most authoritative domains, dropping less authoritative domains with content penalties. Duplicate content can also hurt or prevent rankings for original content. It can hurt sales. In this article we discuss duplicate content issues in detail.

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in Google’s search results.

Affiliate Sites

Google can spot affiliate footprints such as Amazon and eBay code. Once spotted, Google filters out affiliate websites from search results, leaving the most authoritative domain. The goal is to show only one version of content to users in search results, hence one website is enough.

Affiliate sites may also promote products without the feed, but with an affiliate link. Search engines are also aware of those links and may filter out websites with low trust scores.

Online Merchants

Online retailers are hit primarily because their descriptions are:

  • 100% scraped from manufacturers and are identical to dozens of competitors.

  • Too similar, or too short, with little difference from page to page.

If you run a retail site, you have to invest time or money in creating descriptions that are different from the ones provided by the manufacturer(s) and from competing websites that use the same manufacturer descriptions. About 150 – 300 words of original content is usually enough to differentiate yourself in the eyes of search robots.

Andy Jackson of StomperNet did a very detailed video on duplicate content. It can be found in three parts, here, here, and here.

Let’s discuss what he taught in his video.

Merchant sites primarily run into two problems. The first one is that the content is a perfect copy of manufacturer descriptions. This is one is easy to fix: create your own content for each listing. This will either cost you money or a lot of time, but if you want to rank well in the SERPs, that’s the price.

The next problem is similarity of internal pages to one another. For example, one page is selling green widgets and another is selling yellow widgets. The only difference between those pages are the words "green" and "yellow" and pictures of the widgets. To search engines, those pages look almost identical, so they will drop them from their index.

This again must be fixed with original content. Though you must maintain feature descriptions on the page, you can add separate descriptions to each widget to differentiate it from other pages in the eyes of search engines.

Your website has:

  • Navigation menus.

  • Footers (stuff on the bottom).

  • Headers (stuff on the top, like the logo).

  • Right / left panels (or both).

  • Content section.

Those elements stay consistent throughout the website, except the content section. For example, If you navigate seochat.com, the navigation menu on the left will stay the same. The header on top with the SEO Chat logo will stay intact as well; so will the footer. The only thing that will change is the content, which is located in the middle.

We as users and webmasters treat this as normal. That’s just the way things are on the web. Navigation, footers and headers stay the same, while content changes from page to page. This is the correct approach to websites, but that’s not how search engines view and judge pages.

Search engine spiders can only see code. They don’t see the sites as we do. For search engines, the navigation, footer and header look identical on different pages. They look like duplicate "content." For example, if the Google bot navigates SEO Chat from page to page, it will see completely duplicate header code, navigation code and footer code on all of its pages. The only difference Google bot will see is the difference in content.

This is what search engine spiders are programmed to detect – differences in content. Search engineers realize that duplication of the navigation, footer and header is part of web design and web standards, so they program robots to look at those duplications in code as normal, and extract original content from pages.

That’s where the problem comes in with merchants. Imagine you’re selling one red knife and one blue knife. Descriptions have few words, and the only difference between pages is the picture! To search engine spiders, pages look almost 100 percent identical, except for the picture!

But that’s not the only problem. When differentiating pages we also have to keep in mind the content to template word ratio. You can calculate this ratio using seochat.com as example.

  • Open a Word document and copy SEO Chat’s left side navigation, top header, and footer.

  • Use the word count feature to see how many words they contain.

I calculated 217, but you might have something a little different.

Knowing that the footer, header and navigation contain 217 words, you now know how much unique content you need for EACH product on your pages – 217 or more words. Each page on your site should contain exactly as many words (or more) as the navigation, footer and header to appear different to search engine spiders! [In fact, on SEO Chat, an article page typically contains between 300 and 500 words of original content. --Ed.]

You content may be stolen or scraped by spammers, especially as your site gets bigger. Be sure to provide backlinks within your content to your other pages in order to get credit and links.

Spotting Stolen Content

To find out if your content was stolen, use Copyscape. It costs $0.05 per search. Copyscape is in partnership with Google. What can you do when you spot a stolen article or articles? In reality, not much. You can report content theft to Google and wait for action, or contact the webmaster directly. If the article is located on a spam farm wrapped around AdSense, don’t expect to get a response.

If the site is decent, ask them to remove the article, with a mild threat of a lawsuit or exclusion from search results. Though going to court because of a few articles is not worth it, it’s a good scare tactic.

Printer Pages

User pages and printer-friendly pages are considered duplicate content. Though search engines have gotten smarter and can spot the difference, it’s better to block printer versions in robot.txt

User-agent: *

Disallow: example.come/page1/printer.html

User Feedback and User Reviews

It’s very expensive and time consuming to write unique descriptions for thousands of products and keep up a flow of descriptions for new items.

You can use user feedback and reviews as unique content to avoid writing descriptions for all items. On the down side, you have to implement separate technology, which costs money, and once implemented, not all items will have reviews. Usually only the most popular items will receive reviews. It’s also a challenge to entice people to review something.

Another big duplicate content issue is the huge number of content variations based on user preference. For example, some users may filter content by topic, color, date, reviews, price, etc. Essentially the content stays the same while display results get shuffled, producing an infinite number of variations that look identical to search engine spiders.

The solution is to block variations from search engine spiders and only show them one version of content. For example, if your site features products, you can allow search spiders to see a default layout of content (which may be by alphabet, price, etc), but block them from following links that filter results by other criteria. This way there’s no duplicates, while users can still sort the information.

This also applies to blogs. Try putting content only in one section of the website, instead of multiple sections, and block search engine spiders from accessing archives by date.

Session IDs

Session IDs are evil.

Session ID websites give spiders a unique ID. As spiders follow URLs they all think they’re following their own URLs, while in fact they’re seeing exactly the same content. Search engine spiders usually ignore session IDs due to the duplication it produces and because there are no static URLs to feature in search results.

Changing or disabling session IDs is the solution.

Quotes, Similar Pages

Quotes or several copied paragraphs are okay. Internal pages with unique content, but several duplicate paragraphs are okay as well.

Google is much more forgiving of duplicate content issues to authoritative domains. In fact, old sites can get away with much more of this than new ones.

Summary

Duplicate content hurts an entire site’s performance in the search results. Though only a single page may be filtered out from the results, the site’s overall trust score may go down with it.

Duplicate content issues are especially important if you’re a merchant, selling items that are similar to those sold by other merchants. Keep the template to content ratio at least on the same level, get rid of session IDs and block access to sorting features to feed search engines spiders better. The better you feed the spiders, the more they like you.

Google+ Comments

Google+ Comments