Leakage of sensitive information has always existed, but with the advent of search engines it becomes a bigger problem. Now sensitive information is incredibly easy to retrieve, even by beginners. Many people blame the search engines for this, but in most cases search engines just do their job; they index whatever they find on the servers over the internet.
Sure, it is unpleasant at best to have your private data available to anyone. There are really many dangerous things malicious people can do to you, if they have personal information of yours. It is a little comfort that stealing credit card numbers through Google and the other search engines is not (yet) the primary way of getting them. Direct attacks on the database targets are still the primary way to steal information. When posted on a web server that search engines index, this data becomes easily available to everyone. If not removed, it will stay in the search engines indices forever.
I am afraid that most often it is not Google’s fault, because Google’s spider will not break into an office and index files on a computer that is not connected to the internet or are properly isolated from the web. Google’s task is only to index files on web servers, and the fact that it finds so much sensitive information means only one thing. Those who are responsible for keeping this information away from Google’s crawlers have not done their job.
Aside from stolen databases of credit card numbers and social security numbers, it is possible to perform simple queries on Google and get juicy info about people and organizations of interest to you. Also, there are combinations of search keywords and commands that reveal another type of sensitive information: passwords, the schedule of your company, shopping details of your customers, and more. For instance, if you write a query like â€śinurl:passwordâ€ť you will quickly discover URLs where a password string is contained (most likely unencrypted). You may even see the user name itself. When optimized pages, you probably do not have such keywords in mind; however, you wouldn’t be happy to top a listing like that, right? Maybe you as a SEO expert knew that URLs are important for success with search engines, but you are not focusing on security, right?
If you are interested in more details about what is possible to find on an unprotected site, a place where a lot of information is gathered is the web site of Johnny Long. He is an author of books about Google hacking. His site is http://johnny.ihackstuff.com, and an especially informative section is the Google Hacking Database.
Even if security is not your focus, some precautions in dealing with people’s private data is always necessary. Even if Google themselves does not index this information or you manage to remove it from their index, there are enough hackers in the world that will take advantage of somebody else’s negligence and ignorance. It does not hurt to make some efforts to secure your site and, above all, to make sure that security is working.How Can I Prevent Sensitive Data From Appearing on Google?
On one hand, the measures to prevent sensitive data from appearing on Google depend on what kind of site you have and what applications are running on it. On the other, there are measures that are universal no matter what kind the site is.Do Not Put Things on the Web You Wouldn’t Like Your Competitors To See
This is a really simple tactic and it is flawless. If you do not put sensitive data on a server that is connected to the internet, then there is no way for Google to find it and index it. Your web server and online database are not the best hiding places for your sensitive data! It might be difficult if you have only one computer or if your server is connected to an intranet, but still the risks of exposing sensitive information are enough of a reason to buy a second computer or to physically separate your web server from the intranet. One piece of sensitive data that you should never put on a server, which is connected to the internet, is a file where you store passwords and user names. As you can see from the previous section, putting files with passwords and user names is a costly, yet common, mistake.Configure the robots.txt File
The properly configured robots.txt file is a very important tool to protect your server files and directories from being indexed by Google. Google respects the instructions in the robots.txt file, which cannot be said for some of the other search engines. There are some controversial issues around the robots.txt file; by listing which files and directories you do not want to be indexed and putting this listing on your web server, you place information about what you do not want people to see right in the hands of hackers. But in any case, it is much better to have a properly configured robots.txt file to tell the search engines what to exclude than not to have such a file and open up the information to anyone who can search for it.
Tip: Traditionally, the robots.txt file is configured to exclude files or directories and bots that reap e-mail addresses are forgotten. Since e-mail addresses are private information and the reaped e-mail addresses are most often used for spam, you may want to include an instruction to tell mail-reaping robots that they are not welcome.
There are many aspects to consider when configuring the robots.txt file, and a detailed explanation of how to do it is outside the scope of this article. Two good places where you can find more information about the robots protocol, the robots.txt file and all the things around them are: http://www.robotstxt.org/wc/robots.html and http://www.google.com/remove.html.Remove Content From Google’s Index
Although it is often no use to cry over spilled milk, contacting Google and following their instructions helps. Most of the tasks that you have to do are performed via the robots.txt and http://www.google.com/remove.html. That is the place to see what you can exclude from their index, like entire sites, parts of sites, snippets, cached pages, dead links, images.Secure Your Public Servers and Operating System
If your public servers and the operating systems they are running on are not secured, the robots.txt file is of little help. Although Google will keep to the instructions in the robots.txt file and will not index specified files, there are other crawlers on the web that will take advantage of the files they find on your computer. Regardless of the fact that you are telling them in the robots.txt file to keep away, they do not all listen. It is a wise choice to use password-protected folders for especially sensitive data. This definitely helps in keeping privacy, no matter if Google or another search engine are involved as intermediaries or someone directly goes to your computer to hunt for stuff.
If your site has a database as a backend, you need to protect it as well. There are many techniques that exploit vulnerabilities in databases and use SQL injection in order to get access to sensitive data. Depending on which database you are running, the measures that you need to take and the exact steps that are to be performed vary; in any case, applying the latest patches is a must. Also, you may need to talk to your web developer if possible to do it, but a second database that is not accessible from the net could be a wise choice for keeping sensitive data and retrieving it by authorized individuals when necessary. Again, your Web developer is the guy or girl to ask about how to hide columns with sensitive data to exclude them from possible searches.Whose Job Is It?
At first sight, most of the tasks that need to be performed in order to secure a site from disclosing sensitive information seem to be a job more for the system administrator than for the web marketer or the SEO expert. While it is true that it really requires some knowledge and skills in system administration, most of these tasks are not that difficult and can be performed by the SEO expert, or together with the system administrator. And if, as it happens very often, you are both webmaster and SEO expert or are optimizing your own site, then the exclamation â€śBut it is not my job!â€ť becomes absolutely pointless.
Many of the techniques that are used to check the security of sites (or to take advantage of any security oversight) are often called â€śGoogle hacking,â€ť They can be used both by potential hackers and by you. Needless to say, it is much better that you first use them to discover any potential holes than if hackers come first. What is more, very often the measures needed to secure your web server and the pages on it are neither difficult, nor time-consuming, especially if you use automated tools to do the checks. There are several tools for performing automated tests for Google hacking: SiteDigger, Gooscan, WebInspect, and AppDetective. You may want to consider trying several of them in order to see if your site is vulnerable.