Is Google Using Entropy To Combat Spam and Rank Documents? Answer: Probably.
Buried away in a Google patent application from 2006 entitled “DOCUMENT SCORING BASED ON DOCUMENT INCEPTION DATE“, there is a somewhat obscure reference to using the “entropy” of a document. “Entropy” used in this sense is not simply as it’s defined in the field of physics, where your daughter’s room tends towards a maximum state of disorganization; instead, it refers to its definition in the field of Information Theory, which applies it to information rather than atoms.
Wikipedia has a lengthy entry on this, but you can think of Shannon entropy as essentially measuring how much information is in a document.
If you have a 20,000 word document that simply consists of “all work and no play makes jack a dull boy” repeated 2,000 times, that document really doesn’t have a lot of information in it – in fact, it can be represented by “repeat ‘all work and no play makes jack a dull boy 2,000 times;”, so it really only has as much information as a 13-word document in which all the words are different.
If you have a 20,000 word document and every word is different, that document probably has a lot of information in it.
In this seminal paper, Claude Shannon’s concept of absolute entropy was supplemented by a concept of “relative entropy”, which can be thought of as essentially “how much information a document has versus how much it could have if every word was different”.
The Google patent application, in paragraph 61, says:
“the entropy of queries for one or more documents may be monitored and used as a basis for scoring. For example, if a particular document appears as a hit for a discordant set of queries, this may (though not necessarily) be considered a signal that the document is spam, in which case search engine 125 may score the document relatively lower.”
One can imagine that Google may not only use the entropy of queries for which the document is ranking, in fact – why not simply look at the entropy of the document itself? After all, the document is in some larger sense equivalent to all the queries it will rank for.
Well, some folks built a spam detector based on just that concept, and concluded that it works pretty well. Since Google is snapping up Ph.D’s left and right in both the Computer Science and Information Theory fields, entropy is certainly an available tool in its toolbox. My guess is that Google uses it for a myriad of purposes – document scoring, spam detection, near-duplicate detection, document categorization, and more.
What is actionable about this from an SEO standpoint?
- Make sure you included stemmed versions of your target keyword, and related keywords on-page, to mix things up a bit.
- Don’t go crazy saying the same thing over and over again.
- If you have access to a good keyword density analysis tool, pay attention to what sort of keyword density the top ranking pages have for your term, and try not to deviate too far from them.
- Make sure you are saying neither TOO MUCH nor TOO LITTLE. If your content says “Justin Bieber” over and over again, you’re saying too little. On the other hand, if every word on the page is completely different from the others, you are saying too much, and will more closely resemble Bayesian-network generated spam.