2

Dec 11

Stemming for SEO: The Complete Guide

Your Content Optimization Class is Now in Session!

Your Content Optimization Class is Now in Session!

It has long been known that on-page optimization, which typically focuses on document length and keyword density, can be aided by inclusion of words that are related to the target keyword.  Related words aid ranking but can also improve the usefulness of the overall text, its readability, and the degree to which the document appears “natural”.  Word Stems represent some of the *most* tightly related words you can pepper into a web page, and deserve close attention when creating content.  This posting will explore word stems, how Google uses word stems, and will develop some best practices for utilizing word stems in web content.

What are Word Stems?

Word stems can be thought of as the root for a set of very-similar-meaning words that are in different form.  For example: “bats”, “batting”, “batter”, “batted” – all of these share the same stem “bat”, which can be obtained by stripping the suffix characters off of each word (i.e. by stripping the “s” off of “bats”, and so on).  However, a stem need not even necessarily be a valid word.  For instance, the words “bicycle”, “bicyclist”, and “bicycling” all share the stem “bicycl”, which is clearly not a word.  The great thing about stems though is, if you can strip two words down to their stems, and the stems are the same, then the two words must have almost the same meaning and are probably just different forms (plural, adverb, past participle, and so on).  If that’s the case, then the words are about as close as you can get from a relevance standpoint; intuitively, the terms “bicycle” and “bicyclist” are more related than “bicycle” and “inner tube”.

The Porter Stemming Algorithm

Various algorithms have been developed for determining the stem of a word (including a surprisingly little-used form of cheating: looking the word up in a dictionary).  The most popular stemming algorithm is the Porter stemming algorithm, which is about 85% accurate.  In other words, two words that ought to share the same stem are identified by the algorithm to have the same stem about 85% of the time.

The original paper on it can be found here – essentially it is just a set of cascading rules.  It’s actually a lot less sophisticated than you might think and its logic is sort of along the lines of “if the word ends in this then do this unless this exception exists”.

Try the Porter Stemming Algorithm for Yourself

You can try it out for yourself below.  Note however, after you hit the button you have to scroll *way* down to see the results, the box at the bottom of the fold is *not* the actual results:

Porter Stemming Demo Online

The algorithm usually does pretty well, but an example of two words that it fails on are “squeaking”, which stems to “squeak”, and “squeaky”, which, weirdly and frustratingly, stems  to “squeaki” (if anything, you would have thought the other one would have done that(!).  There are a few other stemming algorithms around but they’re only a few percent more accurate at best.

Searching and Word Stems

It t is *extremely* common to type one word, and see results come back that include, or are focused on, a variation of that word.  You may type “bicycling” and receive documents about “bicyclist”.  So an understanding of how Google behaves with regard to word stems, and understanding which variations will help you the most, is critical for content optimization purposes.

 The Landmark Study on how Google Handles Stemming

Researchers in Turkey made extensive observations regarding Google’s stemming behavior and published a landmark study on it in 2009 titled “Google Stemming Mechanisms”.  It’s not freely available but can be purchased here:

Google Stemming Mechanisms

The study attempted  to determine, by analyzing Google SERP results, which word forms were returned for 18,000 different words. They focused in many cases on documents that did  *not* have the query term in them but were returned for the query term, and then recorded statistics about these relationships.

The Study’s Methodology

They would first take a page, analyze it and figure out what forms of a term were on it, then run queries against Google to see how it was indexed.  For instance, to see if a document containing “cyclist” on “www.foo.com/page1.html” is indexed for the term “cycling” (which let’s assume it does not contain), it could be queried simply with [cycling  www.foo.com/page1.html].  They also did some other fancy queries to include and exclude various word forms when investigating singulars and plurals, and multi-word phrases, but you get the general idea.

Is Google Intentionally Handling Stems Differently?

The study speculates about alternate stem-oriented indexes that Google may be maintaining.  It’s not clear to me from the paper whether Google is really explicitly targeting word stems with special algorithms, or whether the results are simply a byproduct of the fact that different word forms for the same term are highly related, by definition.

Google has disclosed in papers on their Paid Search technology that they have access to a proprietary algorithm similar to Latent Semantic Analysis; these sorts of algorithms can identify related words based on how frequently words appear together in a corpus (i.e. a set of documents).  I’ve seen material put out occasionally by SEOMoz speculating or implying that Latent Dirichlet Allocation may be what Google uses; I think that for the machine learning types in the academic community, LDA has been largely superseded in the last couple of years  by Principal Components Analysis.

Regardless of the mechanism, it’s clear that Google looks at how related words are to each other when determining results of a search.  Either way, the study found that regardless of whether Google is *intentionally* handling stems differently, stems seem to consistently act differently than other terms.

Let’s Make a Key Assumption Before Proceeding

The study focused on documents that exclusively contained one form versus another – it did not appear to examine  documents with mixed forms in them.  Let’s assume that the study’s findings can be applied to mixed documents. So, if the study found that documents with [batgirl] were returned for the query term [bat girl], then it’s reasonable to  assume that if you’re optimizing your document for [bat girl] you should also throw in the term [batgirl] a few times,  as it will probably help.  The interpretations and tables I present below are based on that assumption.

Interpretation #1: Singulars can Help Rank for Plurals and Vice-Versa

The study found that documents with singular forms of keywords tended to come up more often for plural-form queries (about 85% of the time) than did documents with plural forms of  keywords came up for singular-form queries (about 59% of the time).  For instance, a document with “coconut” would be returned for  the query “coconuts” a higher percentage of the time  than would a document about “coconuts” being returned for queries about “coconut”.  In other words, singular phrases help you rank for plurals more than  plurals help you rank for singular phrases.  So if you are trying to rank for a plural phrase, including the singular term a few times probably helps.  The opposite is also true, but less so according to the percentages.  Either way, including the other form some number of times is probably wise.

Interpretation #2: Combined Words can Help Rank for Sub-words

The study also examined combined words, in other words – if your content contains [batgirl] will that help it to rank for “bat”, “girl”, “bats”, “girls”, “batsgirl”, “batgirls”, or “batsgirls” as well?  What they found (in our interpretation here) was that content in the form [batgirl] should help you to rank for its direct break-up [bat girl], But [batgirl] will *not* help for inexact break-ups or other plural variations (for instance [bat girls], [bats girl], [bats girls], or [batgirls]).

Interpretation #3: Subwords can Help Rank for Combined Words

Is the converse true though, i.e. should content with [bat girl] help you rank for [batgirl]?  Based on the study results – *yes* – and it will also help you rank for [batgirls], but surprisingly, not [batsgirl].  Individual sub-words aid in ranking for their exact combination, and also for the plural version of that combination, but only if the second word in the combined version is the plural one (i.e. [rat nest] will likely not help you rank for [ratsnest] but could help you with [ratnests].

So, by way of corresponding examples we have Table 1, based on the study’s findings and our interpretation of those here.  Of course, a term will not just help you rank for another term, it can obviously help rank for itself as well ;-).

Table 1 - Effect of Plural/Singular Word Combinations
Table 1 – Effect of Plural/Singular Word Combinations


Prove it to Yourself

Try it for yourself; for a quick understanding of all of this, try doing queries on Google for [bat girl], [batgirl], [batgirls], and [batsgirl] and see what comes back.  You’ll see that Table 1 makes a lot of sense.  Table 1 is a little backwards though and not very useful, let’s flip it around and make it more useful in Table 2:

Table 2 – Best Practice for Singulars, Plurals, and Combination Terms
Table 2 – Best Practice for Singulars, Plurals, and Combination Terms *click to enlarge*


Use Additional Terms In Descending Order of Frequency

For the first additional version, use it 1/4 of the number of times you are using the term you want to rank for, then use ratios of 1/8, 1/16, and 1/32 for others (my recommendations base on experience).

Why is the first one X/4?  Well,X is too big – you’d then be smearing the relevance of the page out amongst *two* terms, and Google might think your document is not about the main term you’re targeting.   So clearly a number smaller than X is the correct one to use.  I like X/4 because presumably a natural-appearing distribution should be some sort of long tail geometric distribution, and X/4 is a reasonable guess in that case.  Any better suggestions would be gratefully appreciated.

For example, if you want to rank for [bat girl]…
…and keyword frequency analysis of the top ranking pages for that term tells you that you need the term [bat girl] 64 times…
…then also include [bat] 16 times…
…[girl] 16 times…
…and [batgirl] 8 times.

Don’t get hung up on hitting exact numbers though, these are all “ballpark” recommendations.

A *Major* Unanswered Question
However, for those combined word situations , the study only examined *valid* combined words; it left unexplored the question of nonsense combined words. In other words, if you want to rank for [squeaky floor] should you include [squeakyfloor] in the document?  This is a *great* question for our industry to explore – I’ve not seen anything on this but surely someone must have tried this! Please comment below if you have seen any evidence on this front.

Different Verb Forms

Table 10 of the paper, below, shows the study’s results for twelve different verb forms.  Column 1 (on the left) represents documents with the particular verb form; Row 1 (at the top) shows the queries that those documents tended to rank for, and the numbers in the table show the % of the time that they ranked.  So, for instance, documents containing “ing” terms (like “boxing”) were returned 38.5% of the time when the query ended in “ed” (like “boxed”):

Google's Behaivor on Verb Stems
Stemming test Results in percentages for 10 different verbs with 12 different postfixes*   click to enlarge

 *Reprinted Here by Permission of SAGE and Ahmet Uyar.
“Google Stemming Mechanisms”,
Journal of Information Science 35 (5) 2009, pp. 499–514 © Ahmet Uyar

When you look at Table 10, certain combinations really stand out.  The top performers (if you look at the rightmost “Average”) column were the Plain Form, the “-ed” form, the “-tion” form, and the “-tive” form. Surprisingly the “-s” form didn’t perform that well (although it performed well in the individual cases “Plain”, “-ed”, and “-ing”, its performance for all the others was abysmal).  Note that “-tive” should help you rank for “-tively”, but the converse is oddly not true.

So, the simple takeaway from this table is: pepper the forms (Plain, -ed, -tion, and -tive) into your content.  Below is a table if you want to be more systematic about it.  I used a value of around 20% in Table 10 as a filter to come up with the table of best practices for verbs below:

Table 3 - Best Practice for Verb Forms
Table 3 – Best Practice for Verb Forms *click to enlarge*


Use the Same Descending Frequency Percentages

For these alternate verb forms I recommend you use the same descending frequency ratios we presented for Table 2 above.

For example, if you want to rank for [creating]…
…and keyword frequency analysis of the top ranking pages for that term
tells you that you need the term [creating] 64 times…
…then also include [create] 16 times…
…[creates] 8 times…
…[creation] 4 times…
…and [created] 2 times.

Again, don’t get hung up on exact numbers, these are rough guidelines.

Why Descending Order and Not Ascending?

An astute reader might question, why do I recommend frequencies descending order and not ascending order (i.e. since intepreting from Table 10, the “-ing” version probably doesn’t help the “Plain” version as much as “-ed” version does, why not have “-ing” appear more frequently in your document, so it can have the opportunity to help as much as “-ed” forms you’re including?).  The reason is, it looks to me that the researchers organized the columns in descending order of frequency in documents (i.e. you probably see the “Plain” version of a verb more often than the “-tively” version), and I believe that peppering in these other forms in descending order is the proper thing to do from the standpoint of making the content appear as *natural* as possible.  The same logic applies to our Table 2 as well.

Another Stemming Use: Meta-Tags

Don’t forget to take advantage of word stems in meta-tags.  For instance, if you have a page targeting keywords like “Bicycle”, you might use a title like “Bicycle – information on Bicycling”.  This way you’re not overloading the title with the same keyword multiple times, but you’re getting a highly related keyword in there.  This should hold for all meta-tags including the meta-description.  Also, note that Google often highlights different stems or word combinations in the title and meta-description in the SERP (see figure 1):

Compond Version of Search Term Bolded in Meta-Description
Compound Version of Search Term Bolded in Meta-Description *click to enlarge*


Use Stems in Your Keyword Research

The AdWords Keyword Research Tool is absolutely *terrible* at returning alternate word stems.  For instance, I did some research recently for a client on “cycling” and came up with thousands of keywords through Adwords – even re-pumping terms back into the tool to find more – but only when I used a third-party keyword tool did I notice the word “cyclist” appear.

I then put that, and a few variations, into the AdWords tool and – voila – hundreds more terms came up that it never suggested in the first place, all highly relevant to what I was researching.

For this reason I *strongly* encourage you use alternate tools in your keyword research to augment it.   Even Google suggest itself is a good place to get ideas (in other words, type the stem very slowly and see what comes up).  It still fails to bring up “cyclist” for “cycl” but it does suggest a few different stem versions, and correctly extends “squeak” to both “squeaking” and “squeaky”.

You might also try Ubersuggest, it’s an interesting new service that mines Google suggest and presents it in list form; make sure you change it from the default language of “Catalan” into your language of choice first though.   Hats off to Dan Shure over at at EvolvingSEO for pointing this tool out to me:

Ubersuggest

Don’t Neglect Other Related Keywords!

Because Google is using this sort of technology, don’t forget to pepper related keywords in addition to stem variations; there are a number of free tools available you can use to analyze SERPs; one I like that I’ve written about before is Textalyser – you can paste a whole bunch of pages into it and it will do frequency counts of all the words, making it very easy to spot good related-word candidates to pepper into your content.

Conclusion

Anyone creating content for the web should have a solid understanding of word stems and should be incorporating both word stems and related keywords into their content and meta-tags as an everyday practice.  This should help your documents to rank better, be more interesting for end-users, and look a little more natural (thus better able to withstand “human review” by Google).  Best of all, they will help keep your documents from looking a little too keyword-stuffed – by getting the keyword in there a few more times – but in *stealthy* form.

6 Comments

  1. RizMi says:

    Hi, This is very tricky to embed same keyword in so many forms. One really needs a good copy writer.

  2. Dan Shure says:

    Hey Ted

    Nice article! And thanks for the mention, although I credit Richard Baxter of SEOGadget as being the original source of how I found Ubersuggest 🙂

    Anyhow, the application of word stemming, and including stem variations seems to make a lot of sense when you have a page where your main target keyword is already known (you’ve done the research). It makes sense that it could only support the primary keyword and also add variety to the document.

    In terms of keyword research, that’s a very interesting observation, about the lack of stemmed results in the AdWords tool and Google Suggest. Perhaps since “bicycle” is a different real world object from “bicyclist” and neither are the action (verb) of bicycling the AdWords tool does not see the three as related.

    When I am doing keyword research, I typically start with a few “buckets”:
    – nouns (people, places, objects, etc)
    – verbs or actions
    – adjectives or descriptive words added onto the nouns

    That way I’m not relying on the tool to bring up a noun/verb relationship for example.

    **Also, I JUST tried entering some stems into the AdWords tool. VERY interesting. Try ‘bicyc’ in the AdWords tool. And THEN try searching ‘bicyc’ in Google… see any relationship to both results? 🙂

    -Dan

  3. Pavlos says:

    Great piece of information. I’ve been trying hard to convince my clients’ marketing departments to enrich their content with various combinations and variations of the target keywords.

    I agree AdWords is a great tool for SEO. In particular, running a campaign there before drawing any decisions can be crucial.

    The only problem with the Google AdWords and Google Suggest keyword data is that you need to work out a lot the combinations and variations/synonyms/etc to have the big picture of what is happening in your market.

  4. Rod says:

    Thanks for the great article.

    Related to keyword density:
    According to you, what’s the best tool (free if possible) to measure “real keyword density” when wirting an article , real meaning not only number of occurences but also where it’s placed (higher>lower), in which markup (h1, he, bold or not…), proximity between two occurences,….?

    is there a tool checking it “live” while you modify the text ?

    Thanks
    Rod

  5. Ted Ives says:

    There are some other free tools similar to textalyser, if you Google [keyword density] you can find them.

    Proximity – not a lot of people have done much work on that from what I can tell, even though it’s a huge pre-Google technology from the 80’s that was a big deal back then, companies like Verity provided products for drug companies and the government, and so on (for looking up FDA filings, presumably for finding intelligence information, and so on).

    Verity is a real sad story actually, they could have been Google but giving their service away for free was just unthinkable for them – a great case study that fans of Clayton Christensen’s “The Innovator’s Solution” would appreciate.

    Today’s Search Engines are thought to use proximity as a signal perhaps but it’s really unexplored territory from an SEO perspective.

    Ranks.NL has a proximity tool here but I haven’t really played with it:
    http://www.ranks.nl/tools/proximity.html

    As far as measuring while you’re doing the work, there are a couple of WordPress plugins that support that, again, if you Google them I’m sure you’ll find them.

Leave a Reply

Pingbacks & Trackbacks

  1. Keyword Research: 8 Ways That Google Tells You The Answers - Pingback on 2014/03/10