Duplicate Content Penalties: Problems with Google’s Filter

Google’s duplicate content filter is notorious. There have been many articles written about it, many forum posts and discussions regarding it, and speculation that runs rampant. What there has seemed be a lack of, however, is discussion as to whether or not other search engines apply duplicate content filters for their search results, like Google does so famously. Why do you suppose that is?

Perhaps it’s because Google has it all wrong, at least in the eyes of those getting penalized for duplicate content. It is, in fact, the ones being penalized that created the original content, but are no where to be found in the SERPs. It feels to those webmasters as if they invented a new product, and because of that product’s popularity duplicates popped up everywhere. Flattering at first, but in the end, it is one of their competitors who duplicated the idea, who patents it as their own. To add insult to injury the original creator of the product gets fined for claiming it was their own. It makes them want to scream “Liar, liar, pants on fire…” Ok, I’m being a bit facetious, but it certainly feels unfair. Is it cheating? Yes, in a way, it is. But this is how the game is played, at least for this moment in time with Google.

While it does seem that most search engines apply a filter of a sort, it appears that only Yahoo and MSN have the technology to analyze where the content originated. And because their filters are working like they should be, you aren’t hearing anyone complaining about it. People generally don’t complain about something that works right; they complain about things that are working incorrectly; in fact many times a bit too loudly.

The Experiment

I decided to take a closer look at this myself, and perform a simple experiment. I performed a test recently on some of the search engine’s duplicate content filters. Honestly, even I was a bit astonished by the results.

I published an article on my website a while back, and then promoted that article through article submission sites. I watched the SERPs for about six weeks, and took note of the results. All three of the major engines, Google, Yahoo, and MSN were analyzed. I then republished the article again through the article submission sites, and watched these results as well. Here are the results:

Google

Initially, there were 14 sites that featured the article, including my own. Within a few weeks, that number grew to approximately 19,000 or so sites, which also contained my website, which was the actual origin of the content. Then after about 5 or 6 weeks, the number of sites featuring the article fell to 46 sites. What’s aggravating is that my site, which was the source of the original content, was not included anywhere in the search results for this article title.

Thinking somehow this was simply an oversight, I submitted the article the second time. The number rose again, but this time to about 1560 sites, and then fell back to the same number: 46. Again, my site was no where in the final SERPs.

So it seems that this search engine only features results that they feel are relevant, without taking to account the actual origin of the content. This is very disturbing, considering how Google prides itself on its objective results.

Yahoo

There seems to be a similar content filter applied to this search engine as well. So while the results of the test are similar, Yahoo does indeed take the actual origin of the content into account, and currently list 114 results for the article title. Although my website appeared first for the article title, it was bumped occasionally during this time, as if Yahoo was trying to decide if my site was indeed the original source of the article. But eventually, it came around.

MSN

I was ultimately only 100% impressed with the way MSN handled the experiment. MSN seems to have the duplicate content filter right. Initially, there were about 10 results for a search query for the title of the test article. Within two weeks, that number rose to 1244, then after four weeks, jumped up to 11,000 or so. By week number 6, the SERPs total for the title of the article resulted in 16 total. I point out that this number is 86% less than Yahoo, and 66% less than Google.

Listed 1st, 2nd, and 3rd in MSN’s results is my website: the article page itself is ranked first, the articles page with the title and summary is ranked second, and then the archives page is ranked third. As the origin of the article, with only 16 results in total, listed my website was listed most relevant. In my opinion, this is exactly how it should be.

Instead of picking and choosing which sites seem to be relevant; as Google is appearing to do, MSN and Yahoo look as if they employ the filter while taking into account the actual origin of the content.

Are People Just Picking on Google?

So why do people write and talk and discuss seemingly only Google’s filter? Perhaps it’s because no one can make any sense of how duplicate content is figured. And in all honesty, it doesn’t make any sense. Why go into all that work to determine how to get rid of duplicate content, when no effort is made on Google’s part to determine what duplicate content really is, and which content is the original? Google’s algorithms have always been baffling, but when you see the results of a filter that is designed to get rid of the duplicate content, and fails to keep the ORIGINAL, it goes beyond baffling; it is infuriating. After all, why should someone else get to take the credit for content that you’ve worked so hard to create, simply because Google likes their site better? And while I’m positive Google hasn’t purposely played favorites, performed popularity contests, or hand-chosen sites it wants in its results pages; it is starting to appear that way to many hardworking webmasters. Is it fair? Not really.

Along the same line, it appears as though extra influence is given to those sites that employ Google’s AdSense, and end up positioning better in the SERPs. Now, we have commercially motivated results, which Google vehemently denies in their mission statement.

There is also some speculation as to if Google treats duplicate content from cached links similarly. In a forum on the subject, one poster says, “What would happen if another search engine that had duplicate content filtering were able to spider Google’s cached links from SERPs but didn’t obey the robots.txt file? Would the Google cached copy, which is technically zero-levels-deep on a site with enormous Link Popularity, cause your version to be filtered out as the lower ranked? Just playing devil’s advocate to inspire some thinking here.” He may have hit closer to home than anyone actually may realize, and it certainly makes my head spin to think about.

The New Spam

Google is still extremely vulnerable, whether they like to admit it or not, to arbitrary influence from spam. It’s these spamming techniques that prompt the creation of these filters in the first place. But has there been a line crossed? Most certainly. It is a huge problem when advocates of white-hat techniques follow the rules, and the ruthless spammers don’t follow any rules, but when those spammers still end up winning, it’s frustrating. In this case, however, it isn’t even about spammers and their intentions. If you have a better PageRank than John Doe, then John Doe’s content will look better for your site than it will for his. Is Google’s criterion for relevancy so hampered by its link relevancy that they can simply fail to determine what original content is and what is duplicate? This seems fairly simple to me.

Terms like 302 Hijacks, Google-bombing, and Google-Washing are new terms; some of which you’ve never heard of, but the concepts are fairly similar. They are basically techniques that black hats, whether they are scrapers or old-fashioned spammers, use to influence a search engines results by attacking the competition, either on purpose or not. I will guarantee you that spammers do it intentionally. Instead of trying to do the right thing to win, they sabotage the results against the competition, even if for only a few weeks. Google denied early on that someone else could hurt your rankings, claiming it was completely impossible. Now, we know this has changed and Google’s Webmaster FAQ pages have been updated to say: “There’s almost nothing a competitor can do to harm your ranking or have your site removed from our index.”

There have been many new coined terms for this type of behavior that takes advantage of the duplicate content filters, probably many of them popped up because “duplicate content filter penalty” is a mouthful. Google-washing, Google-bombing, dupe-wash, source wash and others, all basically mean the same thing. They mean the original source material is washed out by all the duplicate content from across the web. You post your article, or it gets submitted to article distributors, then millions of blogs or scrapers repost your original content, then the source material gets wiped out in the SERPs.

302 Hijacking, is a completely different animal. This refers to another site getting their URL listed even though clicking takes to your domain. Why would they do this, would you ask? Traffic. They don’t care if it’s qualified or not.

Recently, this has happened to one of even Google’s own: Matt Cutts. Matt is known as the “GoogleGuy”. While what’s happened to Matt’s blog is not really “hijacking”, it does bear similarities. The source of a story, in this case, Matt’s blog, has been “washed” out of the results pages by duplicates on other sites; part of a noticeable problem in the way that search engines handle duplicate content, and the trouble they have determining the original. In particular, Google seem very prone to this – and as I have pointed out, no one is exempt…

“What you are seeing…is stealing contents in day light and Google isn’t able, as usual, to differentiate between the original contents and the duplicates. And as many fellow members, which sites either dropped totally from the index or just lost much of their rankings because of the same problem, have reported on this thread. It is a real disaster that both Google and the webmasters community are [phrasing].”

So where do I direct my criticism? Will it be enough for me to know that people are getting fed up with the more and more irrelevant results in the SERPs from the likes of Google, and hope they just use Yahoo or MSN instead?

How do I ensure that my original content is what is going to be listed in the SERPs after six weeks? That is a good question, and one that I probably can’t answer at this time. The only advice I have at this point is just keep submitting the content, and hope that Google catches on. That advice feels lame, however. After all, what comfort or solace does the words “Better luck next time” actually bring, especially when you know the competition has cheated?

Great, original content is always one of the key things you are encouraged to have in your website. But we have seen a huge shift in content importance, especially where Google is concerned, in favor of link popularity. This really has not been a big secret since Google showed up on the search engine scene in 1995 (at that time called “BackRub”). But some of us feel it has now gotten out of hand. Those sites that have far more link superiority get credit for content that may or may not even be their own, while those sites that choose to opt for content over link popularity get punished for duplicate content. MSN seems to be the first search engine to effectively handle the duplicate content accurately, with Yahoo coming in close behind them. Google is still clear out in left field.

So how do we keep all three search engines happy, while getting your original content ranked high in the SERPs? In summary, concentrate on your link popularity, and keep that content coming. You could write Google letters, asking them to remedy the situation, but who knows whether it’ll help. With the way that search engines change, I’d imagine that enough uproar from the community will get Google’s attention.

Google+ Comments

Google+ Comments