Make Money Without Doing Evil – A Lesson in Content Scraping

Google regularly clears out scraper sites and directories built for the sole purpose of generating adsense dollars. While doing so, Google also smacked down a few legitimate websites from their index. The penalties for the few who abuse the rules often hurt those who were behaving well, and the results don’t seem to be pretty. This penalty has its roots in duplicate content and the attempt to manipulate search engines with scripts that regenerated other people’s content into supposedly new pages of content. To Google, duplicate content is not a good thing. It is not good for the search engines. It is not good for the hosting resources of the varios search engines. Most importantly, it is not good for users. As I am sure each one of us reading this article can agree, when we do a search we do not want 10 exact copies of one page that matches our search query. Now a great many will debate that Google could not possibly catch duplicate content that easily and trying to do so would strain their resources, but I have some news for you. I can assure you that Google does eliminate duplicated content from their general index very easily, and not only can they filter the content out, it can also leave certain duplicate content in the index. This area is actually a very important issue, in which Brin had the foresight to see problems and had the algorithim built to weed out this issue before it ever became a major concern. There are duplicate pages on the Internet, and there always will be due to news sources gathering information from the same feeds. In a patent related to “Detecting duplicate and near-duplicate files” filed in January 2001, Google has an invention to detect duplicate content. The patent explains how the search engine works to weed out duplicate content as well as which to filter out of their general index. {mospagebreak title=Google Knows When You’ve Been Naughty} There are two ways that duplicate content is handled. Duplicate news stories are expected, if you look in the newspapers where you live they all share some of the same syndicated content. This translates to the online world, and it is done by document scoring where the original site that posted the news is given original content weight. All others will not have a score applied to the document, so as to prohibit it from ranking well. If you are a commercially focused organization and try to use duplicate content to build multiple websites, your site with the original publishing will get credit. All subsequent copies will not be scored and will be filtered to the very bottom results listings. By the way, the bot works as far down as sentence structures. So if you think there is a way to circumvent the bot, you are wrong. Google makes sure they hire only the brightest minds. You can read the patent by following this link: Google’s Duplicate Content Protection. Now, it has been said that people who spend the time reading patent filings are without a life, and need to get one desperately. Myself, I have a life, one which happens to involve reading patent filings. Some very smart and ambitious people had read about the value of unique content in helping to attain front page rankings in the search engines. They decided to boost their rankings with an easy cheat that uses this concept. They built their own search spiders to go out across the world wide web and scrape pieces of content from a variety of websites related to a chosen keyword term. The scraper bots–which scrape articles, blogs, specialty sites and any site with a niche–return to “unique” content to the user, which you could then safely use on your website to increase your site’s position in the search engines. These software programs come in a variety of names, such as niche bot, article bot, and others. I purchased one myself, when they first hit the market earlier this year. I paid for a subscription and gave the software a work over. It had a fairly well done graphical user interface and was easy enough to get working with a friendly tutorial. I worked my way through entering my chosen keyword term, and let the bot loose. {mospagebreak title=The Day I Tried to Scrape} What I got back let me a little disheartened, where I thought the document would read well and have a great flow, turned out to go from one paragraph talking about a topic related to my keyword term, to the next paragraph, which jumped to discussing another different topic related to my keyword term. This went on through each of the paragraphs. Okay I said let me try again and ran the software through again entering a completely unrelated keyword term, and let it fly. Again I was presented with garbaged paragraphs of text. To say I was disappointed is an understatement. I mean, who couldn’t use a good program to writer original content? Then I realized something besides the fact that I was ripped off. If there was such a program that could write original content, websites like SEOChat (which pay me to write) would not need me to write any longer. I will say that  since I was not happy, I applied to end the subscription and was removed promptly and no additional charges were ever attempted to my credit card. I have not been pestered by the company to try the software again or any other solicitations. Now one thing about me that probably makes me less of a blogger than others in my profession is I will not give any software review on my blog of junk software. The reason I do not do this is because I do not feel that they deserve any advertising from me. My readers should not be sent to a junk site from my site, and I see no reason to build a link to the site. Plus you may have to put up with lawyers, and the less one needs to deal with them the better. Soon the masses started finding out about the supposed ability of these scraper programs to provide original content for their websites and bought into the subscription services and stand alone software programs that are offering this type of content. They started flooding their sites with what they thought was original content. Once some of these sites started to bump other websites out of their rankings, people began to investigate and more than likely started filing spam complaints with Google, thereby setting in motion another chance for the Google engineers to have their fun. You see for each new find that a webmaster makes that allows them to cheat the search engines there is an even more determined Google engineer who wants to stop the brilliant webmaster and all the brilliant webmasters friends by writing a bit of code into the algorithim. A hand ban which could be done easily, doesn’t allow Google to test their expertise nearly as much as using the algorithim. By writing code in the algorthim to eliminate such problems the engineers are able to also learn what will work in certain situations. {mospagebreak title=Google’s Algorithm Solution} After making these changes to the algorithim, there were a few websites that were mistakenly dropped from Googles index. In many forums, there were some very angry webmasters whose sites were dropped inadvertently. Many thought Google was out to destroy their website before being able to ascertain the real reason they were dropped. Google dropping websites from their index is nothing new, as has been noted several times in the past, It is apt to occur in the future as the reasons for such occurances are many. The great thing about new technology is the unknown, and some can be prevented and others cannot. That is the nature of the search engines, and like most things in life, it is a dual edged sword. When a webmaster has spent years of their life working on a website that is a passion and extension of themselves, it is a shock (to say the least) when it goes from drawing nice traffic through Googles free results index to being suddenly dropped from the index. It is understandable that they would look to the forums to find help, but they must also understand that some of the advice in the forums can be hard to decipher as to what is speculation and what is fact-based advice. Being able to pull out the information needed to make the needed changes takes a very sharp mind. The best thing to do when your website has been stricken or dropped from Google’s index is to read the information at this URL: Google Advice. After reading this information and making any needed changes, get the attention of Google as soon as possible to resolve your situation: right here. If you did use any of the content scraping software and want to be reincluded in Google’s index, you will need to rewrite the content and then write to Google telling them that you had made a mistake and that you will not do so again and will abide their guidelines. You should receive a letter from Google if the site has been cleaned up stating that the problem has been brought to their engineers’ attention. From there, it can takes as little as two weeks to as much as two to three months that I have personally seen. Remember, as Google says, you can make money without doing evil.
[gp-comments width="770" linklove="off" ]