Google’s Secret Ranking Algorithm Exposed
Last year, some people from the academic community who hadn’t been snatched up yet by Google or Bing did a really interesting study. Rather than simply researching factor correlations to rankings, as SEOMoz does a great job of doing every so often, they used machine learning techniques to create their own search engine, and trained it to reproduce results similar to Google. After the training process, they extracted the ranking factors from their trained engine and published them and presented on them at an industry conference. They were able, for the queries they trained on, to correctly predict 8 of the top 10 Google results roughly 80% of the time. Not bad, considering Google’s algorithms use “over 200 variables”, and the study only examined 17 of them – obviously they chose wisely. I’ve mentioned this in a previous posting, but I think a really thorough runthrough of the study would be informative and interesting.
How the Study was Done
In their paper “How to Improve Your Google Ranking: Myths and Reality“, they detail what the actual weightings were for the various ranking factors. You could essentially take the values in Figure 4B on page 7 (look at the graph on the upper right – examine the line with the x’s which is the third iteration they converged on – read the values off on the left axis), and construct a regression equation with the weights, i.e.
Rank score = .95 x PageRank + .80 x (# of keyword occurences in hostname) + .58 x (# of keyword occurences in meta-description tag) + ……..
If you were to pull all of the factors from the top 200 SERP results for a particular keyword, then apply them to this equation to come up with a score for each result, then sorted them by this score, you’d have a shot at reproducing the correct order for the top 10 SERPs. Doing so would of course be a significant effort, and I am unaware of anyone publishing anything duplicating their results.
This was a revolutionary study, because you can look at the SEOMoz calculated correlation variables all you want, but you can’t really construct a valid regression equation from them, as correlations don’t exactly add (there’s cross-correlation between them, and probably a myriad of other statistical issues with doing so).
Was or is the Study Valid?
There are arguments that this study is inconclusive or only partially useful, since there was a particular set of keywords studied, the study was done pre-Caffeine, and the only off-page factor studied was PageRank. Yes, SEOMoz recently found that the highest correlated factor for ranking was Facebook likes, certainly things have changed since Caffeine and so on. However, think about all of this from Google’s perspective. How much can they really upset the entire apple cart by changing everything? The web has changed in the last couple of years, but I would argue – not a lot – and even if there have been major changes to Google’s algorithms, and certainly there are many unaccounted-for variables, I am of the opinion that things cannot have changed that much.
Either way, examining the results of this study are very instructive and an interesting thought exercise for understanding how and why SEO works the way it does.
The Ranking Factors the Study Confirmed
Below I’ve reproduced each ranking factor listed in the paper, and have eyeballed the values off of the graph for the weightings. What’s interesting is not the exact values, but the ordering and also the very nature of the factors they analyzed:
Bounded vs. Unbounded, Linear vs. Logarithmic
Almost all of these variables have bounds to them. For instance, you can only put the keyword in a title so many times before you “trip a search spam filter”. The age of a domain is ultimately bounded to whenever the domain name system started, and so on. There is one variable that is not bounded – PageRank. It is interesting to note however, that this one is logarithmic – each level requires, on average, 5 times as many links to reach (for more on this, see a previous article I wrote for SearchEngineLand on that topic here).
So, you can get all the PageRank you want, but it’s going to get harder and harder the more you do it, relative to the other variables. This explains why some of the cheapest things you can do (i.e. highest ROI) are to fix your title, meta-description, H1, and so on, and then get a few links to get the page’s PageRank up to a PR2 or PR3 level.
Surprisingly, this study found value in having outbound links on the page with anchor text that includes the keyword. I’ve marked this as “bounded” because again, if you have too many outbound links with targeted anchor text, you’re likely to be identified as search spam.
Incoming Anchor Text
The biggest missed opportunity in this study was not looking at keywords in inbound anchor text. I am postulating in the table that this is unbounded and linear. I think many of us have seen examples of situations in the SERPs where a PageRank 5 page is being outranked by a PageRank 2 page, the difference being something like 800 incoming links with targeted anchor text. My belief is that the weighting of this variable is very low (on the order of .05 -.1), but linear – this would explain why anchor text is the be-all and end-all of SEO – it may have a low weighting, but more just plain helps, and your ability to get it is virtually unlimited. Also you get sort of a double value, in that the anchor text is probably one factor in ranking, and the link itself slightly increases your PageRank factor.
It’s important to note that SEOMoz’s correlation research shows a fairly low correlation of ranking to incoming links with exact anchor text. But if this corresponds to a weighting that is linear and unbounded, then even a weak correlation, when multiplied by a large enough number of links, could make a huge difference in ranking. I am of the opinion that a lot more research into this is warranted.
Other unbounded variables that may be useful to Google for ranking purposes include of course, Tweets, Facebook Likes, and (when enough data accumulates but probably not yet) – Google +1′s.
Other interesting takeaways from the study – there is such a thing as over-optimization (i.e. keywords in H4 and H5 tags can actually hurt you slightly), keyword density matters (so get on the cluetrain, anti-keyword-density people!), and keyword-rich domain names are extremely important.
The study of course isn’t valid for specialized portions of Google’s search algorithms such as what order YouTube videos sort, or the Local Search component of Universal Search – many of these use other (of the 200+) factors.
However, the study illustrate a few things about SEO overall. It makes the most sense, from the perspective of the weightings available for each factor, to take care of your easy on-page issues first, then work on building up links (actually, it makes the most sense of all to buy an exact-match domain name first!). This explains why most people in this field typically take care of issues in that order, and explains the natural logical flow of SEO efforts starting with getting architecture right, then optimizing your content, and finally focusing on linking. Essentially SEO, like so many other fields, is all about identifying the work with the highest ROI-to-effort ratio and focusing on that first.