This is a pointless update.
I haven’t updated this blog for over a month. Google likes it when I update frequently. So I’m doing a pointless update to see what Google does. Last year I started updating the blog every two weeks and have noticed a huge increase in search traffic since then (anecdotal evidence I know, but controlled experiments on SEO are difficult).
Now my updates are less frequent, my traffic has dipped a little again. So I’m hoping that this update will cause my traffic to increase again.
So it’s not really a pointless update, it’s part of an experiment to see how Google responds. And in fact, I’ve actually managed to turn this into a vaguely interesting post about search engines after all, so the experiment is probably hopeless flawed. But read on if you’re interested in some chatter about how search engines like Google actually work.
Talking of search signals, there are a few different philosophies when it comes to search. At the moment, the way search engines work is essentially the ‘bag of words’ approach. Which means, essentially, that they treat the document simply as a list of words with no order or meaning; “I am a funny dog” is treated the same as “dog a am funny I”.
Finding relevant documents is then just a matter of counting the number of words which match between the search query and document. Some simple steps are taken to help, such as ‘stemming’ which involves stripping the word ending. For example “Ending” might be treated as ‘end’ so that ‘ends’, ‘end’ and ‘ending’ all match. You can see this in action on Google results pages where words which are similar to the ones you typed are highlighted.
However, the order of words in English is important, and search engines lose some information by dropping the order. To get some of that information back, search engines also take n-gram sub-phrases from the document, so rather than the simplistic bag of words, they also enumerate all sub-phrases contained within the document to help with matching. For example, if your document consists of the following text: “I am a funny dog”, then the search engine might internally represent that document as “dog”, “am”, “fun”, “I am”, “am a”, “a funny”, “funny dog”. this representation has stripped ‘stopwords’ – words which are so common they’re not very helpful, like “I” and “a”, it’s stemmed “funny” to “fun” and it’s included bigram (two word) phrases which means that it’ll rank better for “funny dog” than for “a dog” – in both cases the query consists of 2 words which are present in the document, but using 2-gram phrases in the index, we can write an algorithm which realises that “funny dog” is also in the document, so the “funny dog” query gets 3 hits in our index, while “a dog” just gets 2.
I’ve been told that as of 2009 Google had collected 6-gram phrase data for the entire web (that’s quite a lot of data). These words may also be weighted according to where they appear inside the document (towards the top? in a heading? as link-text?).
Documents are also weighted according to their authority, as calculated by how many sites link to your site (known as PageRank). And update frequency is another factor. Hence this update, just to see what effect it has really.
I mentioned earlier about different search philosophies, but I think that’s probably a matter for a more in-depth post with pictures and code examples and all sorts of genuine content.
If your appetite for search engine information has been whetted but not satisfied, I highly recommend my boss’s podcast about how SEO is a modern myth.