Stemming for SEO: The Complete Guide
It has long been known that on-page optimization, which typically focuses on document length and keyword density, can be aided by inclusion of words that are related to the target keyword. Related words aid ranking but can also improve the usefulness of the overall text, its readability, and the degree to which the document appears “natural”. Word Stems represent some of the *most* tightly related words you can pepper into a web page, and deserve close attention when creating content. This posting will explore word stems, how Google uses word stems, and will develop some best practices for utilizing word stems in web content.
What are Word Stems?
Word stems can be thought of as the root for a set of very-similar-meaning words that are in different form. For example: “bats”, “batting”, “batter”, “batted” – all of these share the same stem “bat”, which can be obtained by stripping the suffix characters off of each word (i.e. by stripping the “s” off of “bats”, and so on). However, a stem need not even necessarily be a valid word. For instance, the words “bicycle”, “bicyclist”, and “bicycling” all share the stem “bicycl”, which is clearly not a word. The great thing about stems though is, if you can strip two words down to their stems, and the stems are the same, then the two words must have almost the same meaning and are probably just different forms (plural, adverb, past participle, and so on). If that’s the case, then the words are about as close as you can get from a relevance standpoint; intuitively, the terms “bicycle” and “bicyclist” are more related than “bicycle” and “inner tube”.
The Porter Stemming Algorithm
Various algorithms have been developed for determining the stem of a word (including a surprisingly little-used form of cheating: looking the word up in a dictionary). The most popular stemming algorithm is the Porter stemming algorithm, which is about 85% accurate. In other words, two words that ought to share the same stem are identified by the algorithm to have the same stem about 85% of the time.
The original paper on it can be found here – essentially it is just a set of cascading rules. It’s actually a lot less sophisticated than you might think and its logic is sort of along the lines of “if the word ends in this then do this unless this exception exists”.
Try the Porter Stemming Algorithm for Yourself
You can try it out for yourself below. Note however, after you hit the button you have to scroll *way* down to see the results, the box at the bottom of the fold is *not* the actual results:
The algorithm usually does pretty well, but an example of two words that it fails on are “squeaking”, which stems to “squeak”, and “squeaky”, which, weirdly and frustratingly, stems to “squeaki” (if anything, you would have thought the other one would have done that(!). There are a few other stemming algorithms around but they’re only a few percent more accurate at best.
Searching and Word Stems
It t is *extremely* common to type one word, and see results come back that include, or are focused on, a variation of that word. You may type “bicycling” and receive documents about “bicyclist”. So an understanding of how Google behaves with regard to word stems, and understanding which variations will help you the most, is critical for content optimization purposes.
The Landmark Study on how Google Handles Stemming
Researchers in Turkey made extensive observations regarding Google’s stemming behavior and published a landmark study on it in 2009 titled “Google Stemming Mechanisms”. It’s not freely available but can be purchased here:
The study attempted to determine, by analyzing Google SERP results, which word forms were returned for 18,000 different words. They focused in many cases on documents that did *not* have the query term in them but were returned for the query term, and then recorded statistics about these relationships.
The Study’s Methodology
They would first take a page, analyze it and figure out what forms of a term were on it, then run queries against Google to see how it was indexed. For instance, to see if a document containing “cyclist” on “www.foo.com/page1.html” is indexed for the term “cycling” (which let’s assume it does not contain), it could be queried simply with [cycling www.foo.com/page1.html]. They also did some other fancy queries to include and exclude various word forms when investigating singulars and plurals, and multi-word phrases, but you get the general idea.
Is Google Intentionally Handling Stems Differently?
The study speculates about alternate stem-oriented indexes that Google may be maintaining. It’s not clear to me from the paper whether Google is really explicitly targeting word stems with special algorithms, or whether the results are simply a byproduct of the fact that different word forms for the same term are highly related, by definition.
Google has disclosed in papers on their Paid Search technology that they have access to a proprietary algorithm similar to Latent Semantic Analysis; these sorts of algorithms can identify related words based on how frequently words appear together in a corpus (i.e. a set of documents). I’ve seen material put out occasionally by SEOMoz speculating or implying that Latent Dirichlet Allocation may be what Google uses; I think that for the machine learning types in the academic community, LDA has been largely superseded in the last couple of years by Principal Components Analysis.
Regardless of the mechanism, it’s clear that Google looks at how related words are to each other when determining results of a search. Either way, the study found that regardless of whether Google is *intentionally* handling stems differently, stems seem to consistently act differently than other terms.
Let’s Make a Key Assumption Before Proceeding
The study focused on documents that exclusively contained one form versus another – it did not appear to examine documents with mixed forms in them. Let’s assume that the study’s findings can be applied to mixed documents. So, if the study found that documents with [batgirl] were returned for the query term [bat girl], then it’s reasonable to assume that if you’re optimizing your document for [bat girl] you should also throw in the term [batgirl] a few times, as it will probably help. The interpretations and tables I present below are based on that assumption.
Interpretation #1: Singulars can Help Rank for Plurals and Vice-Versa
The study found that documents with singular forms of keywords tended to come up more often for plural-form queries (about 85% of the time) than did documents with plural forms of keywords came up for singular-form queries (about 59% of the time). For instance, a document with “coconut” would be returned for the query “coconuts” a higher percentage of the time than would a document about “coconuts” being returned for queries about “coconut”. In other words, singular phrases help you rank for plurals more than plurals help you rank for singular phrases. So if you are trying to rank for a plural phrase, including the singular term a few times probably helps. The opposite is also true, but less so according to the percentages. Either way, including the other form some number of times is probably wise.
Interpretation #2: Combined Words can Help Rank for Sub-words
The study also examined combined words, in other words – if your content contains [batgirl] will that help it to rank for “bat”, “girl”, “bats”, “girls”, “batsgirl”, “batgirls”, or “batsgirls” as well? What they found (in our interpretation here) was that content in the form [batgirl] should help you to rank for its direct break-up [bat girl], But [batgirl] will *not* help for inexact break-ups or other plural variations (for instance [bat girls], [bats girl], [bats girls], or [batgirls]).
Interpretation #3: Subwords can Help Rank for Combined Words
Is the converse true though, i.e. should content with [bat girl] help you rank for [batgirl]? Based on the study results – *yes* – and it will also help you rank for [batgirls], but surprisingly, not [batsgirl]. Individual sub-words aid in ranking for their exact combination, and also for the plural version of that combination, but only if the second word in the combined version is the plural one (i.e. [rat nest] will likely not help you rank for [ratsnest] but could help you with [ratnests].
So, by way of corresponding examples we have Table 1, based on the study’s findings and our interpretation of those here. Of course, a term will not just help you rank for another term, it can obviously help rank for itself as well ;-).
Prove it to Yourself
Try it for yourself; for a quick understanding of all of this, try doing queries on Google for [bat girl], [batgirl], [batgirls], and [batsgirl] and see what comes back. You’ll see that Table 1 makes a lot of sense. Table 1 is a little backwards though and not very useful, let’s flip it around and make it more useful in Table 2:
Use Additional Terms In Descending Order of Frequency
For the first additional version, use it 1/4 of the number of times you are using the term you want to rank for, then use ratios of 1/8, 1/16, and 1/32 for others (my recommendations base on experience).
Why is the first one X/4? Well,X is too big – you’d then be smearing the relevance of the page out amongst *two* terms, and Google might think your document is not about the main term you’re targeting. So clearly a number smaller than X is the correct one to use. I like X/4 because presumably a natural-appearing distribution should be some sort of long tail geometric distribution, and X/4 is a reasonable guess in that case. Any better suggestions would be gratefully appreciated.
For example, if you want to rank for [bat girl]…
…and keyword frequency analysis of the top ranking pages for that term tells you that you need the term [bat girl] 64 times…
…then also include [bat] 16 times…
…[girl] 16 times…
…and [batgirl] 8 times.
Don’t get hung up on hitting exact numbers though, these are all “ballpark” recommendations.
A *Major* Unanswered Question
However, for those combined word situations , the study only examined *valid* combined words; it left unexplored the question of nonsense combined words. In other words, if you want to rank for [squeaky floor] should you include [squeakyfloor] in the document? This is a *great* question for our industry to explore – I’ve not seen anything on this but surely someone must have tried this! Please comment below if you have seen any evidence on this front.
Different Verb Forms
Table 10 of the paper, below, shows the study’s results for twelve different verb forms. Column 1 (on the left) represents documents with the particular verb form; Row 1 (at the top) shows the queries that those documents tended to rank for, and the numbers in the table show the % of the time that they ranked. So, for instance, documents containing “ing” terms (like “boxing”) were returned 38.5% of the time when the query ended in “ed” (like “boxed”):
*Reprinted Here by Permission of SAGE and Ahmet Uyar.
“Google Stemming Mechanisms”,
Journal of Information Science 35 (5) 2009, pp. 499–514 © Ahmet Uyar
When you look at Table 10, certain combinations really stand out. The top performers (if you look at the rightmost “Average”) column were the Plain Form, the “-ed” form, the “-tion” form, and the “-tive” form. Surprisingly the “-s” form didn’t perform that well (although it performed well in the individual cases “Plain”, “-ed”, and “-ing”, its performance for all the others was abysmal). Note that “-tive” should help you rank for “-tively”, but the converse is oddly not true.
So, the simple takeaway from this table is: pepper the forms (Plain, -ed, -tion, and -tive) into your content. Below is a table if you want to be more systematic about it. I used a value of around 20% in Table 10 as a filter to come up with the table of best practices for verbs below:
Use the Same Descending Frequency Percentages
For these alternate verb forms I recommend you use the same descending frequency ratios we presented for Table 2 above.
For example, if you want to rank for [creating]…
…and keyword frequency analysis of the top ranking pages for that term
tells you that you need the term [creating] 64 times…
…then also include [create] 16 times…
…[creates] 8 times…
…[creation] 4 times…
…and [created] 2 times.
Again, don’t get hung up on exact numbers, these are rough guidelines.
Why Descending Order and Not Ascending?
An astute reader might question, why do I recommend frequencies descending order and not ascending order (i.e. since intepreting from Table 10, the “-ing” version probably doesn’t help the “Plain” version as much as “-ed” version does, why not have “-ing” appear more frequently in your document, so it can have the opportunity to help as much as “-ed” forms you’re including?). The reason is, it looks to me that the researchers organized the columns in descending order of frequency in documents (i.e. you probably see the “Plain” version of a verb more often than the “-tively” version), and I believe that peppering in these other forms in descending order is the proper thing to do from the standpoint of making the content appear as *natural* as possible. The same logic applies to our Table 2 as well.
Another Stemming Use: Meta-Tags
Don’t forget to take advantage of word stems in meta-tags. For instance, if you have a page targeting keywords like “Bicycle”, you might use a title like “Bicycle – information on Bicycling”. This way you’re not overloading the title with the same keyword multiple times, but you’re getting a highly related keyword in there. This should hold for all meta-tags including the meta-description. Also, note that Google often highlights different stems or word combinations in the title and meta-description in the SERP (see figure 1):
Use Stems in Your Keyword Research
The AdWords Keyword Research Tool is absolutely *terrible* at returning alternate word stems. For instance, I did some research recently for a client on “cycling” and came up with thousands of keywords through Adwords – even re-pumping terms back into the tool to find more – but only when I used a third-party keyword tool did I notice the word “cyclist” appear.
I then put that, and a few variations, into the AdWords tool and – voila – hundreds more terms came up that it never suggested in the first place, all highly relevant to what I was researching.
For this reason I *strongly* encourage you use alternate tools in your keyword research to augment it. Even Google suggest itself is a good place to get ideas (in other words, type the stem very slowly and see what comes up). It still fails to bring up “cyclist” for “cycl” but it does suggest a few different stem versions, and correctly extends “squeak” to both “squeaking” and “squeaky”.
You might also try Ubersuggest, it’s an interesting new service that mines Google suggest and presents it in list form; make sure you change it from the default language of “Catalan” into your language of choice first though. Hats off to Dan Shure over at at EvolvingSEO for pointing this tool out to me:
Don’t Neglect Other Related Keywords!
Because Google is using this sort of technology, don’t forget to pepper related keywords in addition to stem variations; there are a number of free tools available you can use to analyze SERPs; one I like that I’ve written about before is Textalyser – you can paste a whole bunch of pages into it and it will do frequency counts of all the words, making it very easy to spot good related-word candidates to pepper into your content.
Anyone creating content for the web should have a solid understanding of word stems and should be incorporating both word stems and related keywords into their content and meta-tags as an everyday practice. This should help your documents to rank better, be more interesting for end-users, and look a little more natural (thus better able to withstand “human review” by Google). Best of all, they will help keep your documents from looking a little too keyword-stuffed – by getting the keyword in there a few more times – but in *stealthy* form.