Fast PHP array_unique for removing duplicates

PHP’s native dedupe function, array_unique, can be slow for large amount of input. Here I’m going to talk about a simple function that performs the same task but which runs a lot faster.

Often, people spout PHP optimisation advice that is incredibly misguided, so I want to make it clear up-front that you should be benchmarking your scripts using xdebug before you start optimising, and very often the real bottlenecks cannot be avoided by using “micro-optimization”. Here’s a nice PDF on Common Optimization Mistakes.

Now I’ve got that disclaimer out of the way, I believe it’s OK to talk about making a faster array_unique because it’s often used on large amounts of data – for example removing duplicate email addresses, and as such there are genuine use cases for a fast array_unique in PHP. Also our fast_unique function can be several seconds faster than array_unique so we’re not necessarily talking about micro-optimisation here; on 200k duplicate entries, array_unique takes nearly 5 seconds on my windows dev box, fast_unique takes <1 second on the same data. On 2 million entires the results are staggering – performance graphs are given below (but then, why are you doing that kind of work in PHP?).

The function looks like this:

function fast_unique($input) {
        // Code documented at: https://puremango.co.uk/?p=1039
        return array_flip(array_flip(array_reverse($input,true)));
}

This will return a copy of the input array with duplicates removed and keys preserved.

When using this function, there are some important caveats you should be aware of:

The data is not returned in the same order as array_unique would return it. You can add a ksort afterwards, but this will reduce the speed savings. Key order preservation is often not required, so I haven’t included that by default. Key association is preserved, just not the ordering.
This only works with strings or integers for keys and values. You could hack together a __toString walk to get it to deal with objects too, and I may do that at some point.
I haven’t yet benchmarked memory consumption; memory_get_peak_usage doesn’t seem to work properly with the native array_unique. Help?
For smaller amounts of data (<50k entries) I would suggest keeping array_unique for code-clarity. fast_unique is still faster than array_unique, but at those levels, you’re talking less than 0.5 seconds difference.

To explain how it works, let’s go through an example: let’s say we start with this array:

0=>apple 1=>biscuit 2=>cabbage 3=>biscuit
When we array_flip it, we try to create an array like this:

apple=>0 biscuit=>1 cabbage=>2 biscuit=>3
but now that “biscuit” is a key it can’t have two values, so the “biscuit=>3″ assignment overwrites the first biscuit value, and we end up with this:

apple=>0 biscuit=>3 cabbage=>2
then we flip it again to get key=>value pairs again instead of value=>keys:

0=>apple 3=>biscuit 2=>cabbage
And there we go – duplicates removed! Now in this example, later keys (i.e. 3=>biscuit) overwrote earlier ones (1=>biscuit) – but with PHP’s native array_unique earlier keys are preserved, so in the fast_unique function above, we reverse the array before flipping to mimic that functionality – if you don’t care about keys then go ahead and strip it out.

I’ve benchmarked this code for speed against various input sizes: 2k,20k,200k,2m, and various levels of uniqueness: 0%, 50% and 100%. An “input size” of 2000 means we fed in an array with 2000 entries whose values are strings between 1 and 200 characters in length. 2 million duplicate entries took too long to generate, so I didn’t benchmark that size for the 0% unique test.

So there you have it; quite dramatic for large amounts of data. I feel I should mention that the “double flip” method is nothing new; there are notes on the PHP docs to the effect that “the array flip method is faster”, but I thought a rigorous benchmarking and documentation was worth doing, so please don’t hate me if you “invented” this years ago ;)

Also, if this is the first time you’ve read about PHP optimisation, do remember that most of the time it’s pointless trying to speed up PHP code. Array_unique is a special case because it’s often used on large arrays (and if your array is smaller than around 50k entries, it’s probably not worth using this function). Read that Common Optimization Mistakes PDF, and my other post on fast PHP for tips on how to correctly approach PHP optimisation.

array_flip, array_reverse, array_unique, code, developer, fast, optimisation, PHP, tech

This entry was posted on June 21, 2010, 12:23 am and is filed under PHP. You can follow any responses to this entry through RSS 2.0. Both comments and pings are currently closed.

#1 by steve on June 25, 2010 - 7:59 am

clever trick – thanks for the full explanation – and the notes on premature optimization – very helpful and informative

#2 by john on July 9, 2010 - 7:21 am

array_keys(array_flip()) is faster

#3 by Howard Yeend on July 9, 2010 - 9:06 am

true, but that doesn’t preserve the keys.

the fast_unique function preserves keys (though not the order of the keys).

And, how much faster? Aren’t we getting close to micro-optimisation here?

#4 by Stefan Frömken on August 17, 2012 - 2:16 pm

Following preserves the keys like array_unique:

$tempArray = array();
foreach($arr as $key => $val) {
if(isset($tempArray[$val])) continue;
$tempArray[$val] = $key;
}
$arrUnique = array_flip($tempArray);

Maybe it will help somebody.

Stefan

puremango.co.uk