Archive for category machine learning

XKCD Colour Survey – a 3D visualisation

Randall Munroe (of XKCD) has been running a “name the colour” survey for the last few months and today released the data. The results are broken down by gender, and he makes some fascinating obvservations.

I’ve been working on a colour-related project on and off for about a year now, and part of that has involved creating visualisations of the RGB colour space – plotting red on the X axis, blue on Y, green on Z. So I thought it would be interesting to map XKCD’s 954 most common colour names in three dimensions – his charts are nice n’ all, but sexy javascript 3d is, well, sexier:

XKCD Colour Names in 3d

If you click that, you’ll be taken to the demo page, where you can interact with the cube. It’s a bit juddery, working entirely with divs and DOM manipulation, rather than canvas. Here’s a view from the side angle that shows off the “3d-ness” a bit more:

Of course, a canvas is really a better solution, so here’s a version that renders onto canvas, which is much much faster, but currently omits the names:

Read the rest of this entry »

, , , , , ,

7 Comments

The K-Means Clustering Machine Learning Algorithm

The k-means clustering algorithm is one of the simplest unsupervised machine learning algorithms, which can be used to automatically recognise groups of similar points in data without any human intervention or training.

The first step is to represent the data you want to group as points in an n-dimensional space, where n is the number of attributes the data have. For simplicity let’s assume we just want to group the ages of visitors to a website – a one-dimensional space. Let’s assume the set of ages is as follows:

{15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65}

Now there are a number of ways we could separate these points out; one obvious way that springs to mind is simply to iterate over the set and find the largest gap between adjacent ages. For some sets this can work quite nicely, but in our case it would give us quite unbalanced groups; 15-44 and 60-65, the latter containing just 3 points, and the former being far too broadly distributed.

Using k-means clustering we can obtain more tightly defined groups; consider the set {1,2,3,5,6,9} – using the simplistic greatest distance technique we gain two sets {1,2,3,5,6} and {9}, while the k-means algorithm produces a grouping of {1,2,3} and {5,6,9} – more cohesive clusters with less dispersion of points inside the groups – k-means tries to minimise the sum of squares within a cluster:

Notice how in the first set of clusters the outermost points of the first cluster are quite far away from the middle of the cluster, while in the second set, the points in both clusters are closer the center of their groups; making these clusters more well-defined, less sparsely populated.

Now let’s have a look at the algorithm and work through the example age data to see if we can get a tighter grouping than 15-44 and 60-65 using clustering:

Read the rest of this entry »

, , , ,

3 Comments