Fork me on GitHub

Saturday, August 30, 2008

PHP and Arrays

I’ve been working on this bplustree library for the whole last month; I have to confess I’ve got scared more than a few times, and I was going to fear I couldn’t make it to the end; finally, looks like it’s almost there. Now I only have a talk with my tutor to ask a few things.

php-programming.jpg

Aside: despite all of my best intentions, although I’ve put all of my efforts to make it hopefully backward compatible, this library has been only tested under PHP5, and I can’t assure it will work with PHP4 as well (if at all).

By the way, let’s go straight to the subject of this post, which is meant as a little compendium about PHP performances and issues about arrays.

In PHP there’s no formal distinction between integer-indexed arrays and associative arrays, which are indexed using a string.

$arr = array();
$arr[0] = 42;

works as fine as

$arr = array();
$arr['answer-to-life-universe-and-all'] = 'fourty-two';

The bplustree library I’ve been working on is essentially a port of a python library (bplustree.py by Aaron Watters) I’ve read about on a paper about genome indexing in bio-informatics. As it sounded well tested enough I’ve finally decided to base on that my PHP class.

Python and PHP are of course two fairly different languages.

In PHP we have several functions and ways to deal with arrays; in this post I’ll list some and write a little summary about performance comparisons and “oddities” using what I’ve found on the net, and what I think I’ve found out myself.

Maybe nothing new, but useful as a remind for myself, and maybe for some of you :)

Walking arrays

Dealing with arrays is quite different in PHP and Python; while in PHP you just have arrays, in Python you can choose between lists, dictionaries, tuples… and, yes, even arrays, which - from what I’ve understood - are a completely different beast.

While the PHP approact makes everything much simpler for quick scripting, it might get confusing, and lead to unexpected results.+

For instance, PHP arrays can be naturally sparse.

So it’s perfectly legal defining an empty array and fill it with random numeric keys, as in

$my_arr = array();
$my_arr[11] = 'another value';
$my_arr[5] = 'some value';

Now, how many elements are there in the array? 2 or 11?
The correct and for someone even obvious answer is two. What’s less obvious is which key comes first: does the key 11 come before or after 5.

As I’m asking, you’ve probably figured it out already: 11 comes before 5.

The reason is that in PHP arrays order of insertion in preserved.

So, this code

$array1 = array();
$arr[0] = 0;
$arr[2] = 2;
$arr[1] = 1;

is not really the same as this one

$array2 = array();
$arr[0] = 0;
$arr[1] = 1;
$arr[2] = 2;

although you’ll end up with the same contents: if you print_r($array1) you’ll get them in the order “(0, 2, 1)” instead of the (probably) expected “(0, 2, 1)”.

Iterating over an array

The whole point of this is that there is a difference between iterating over your array using a foreach($arr as $k=>$v) { /* do something */ } (or while/each, see above), which will walk them in the order of insertion, and manually indexing using the C-style for($i=0; $i<$len; $i++){ $v = $arr[$i] } which will get the elements in the intended order.

In order to avoid this behaviour which might be not what you expect you can either

  • ksort() (sort by key) the array before iterating over it (slower)
  • always set the keys in order (not always viable)
  • initialize it before filling, using a fixed number of elements [1]

Initialization can be accomplished using range(0, $N), which will create an array of $N+1 elements, with values from 0 to $N, or use array_fill() (my choice).

Of course the last one will work only for non-associative arrays (non-indexed using strings).

Now, about looping performances: foreach($arr as $k=>$v) is the quicker solution, but while(list($k,$v)=each($arr)) has a smaller memory foot-print, because foreach usually makes a copy of $arr.

Walking a 100.000 elements array ($a=range(0,100000); shuffle($a);) :

  • while/each is an order of magnitude slower than foreach (~0.19 vs ~0.04 secs);
  • if you don’t care either about the order or the $key you can use while/next but
  • using for() (in non-assoc arrays) is an order of magnitude quicker (~0.080 vs. ~0.007) with the added bonus of walking in order, whichever the order of insertion is.

Finally, in our case, having to walk tuples of numerically-indexed values, for is the winner, but remember not to use $i<count($arr) in the ‘for()’ condition, since the count() function would be called at each iteration: remember to call it before entering the loop and using a temporary variable to store the length.

Inserting values

Depending on where the value has to be inserted, you might take different approaches, apart from the most obvious $a[’k']=’v’, which in fact append the key,value at the bottom (as seen in the ‘walking arrays’ paragraph).

First of all, let’s start with the numerically-indexed array.

  • If you have to put an element on top of an array, pushing the other values down, use array_unshift(&$array, $value); the new element will be at index 0, while the older index 0 will be now at 1.
  • if you have to put an element at the bottom of your array, do not use array_push(&$array, $value), and instead use $array[]=$value which is twice as fast (~0.144 vs. ~0.075); using array_push() seems to be as slow as using $arr[count($arr)]

Those will work for associative arrays as well, but they will return a new array where keys are integers.

The function to call for the more general case is array_splice(&$arr, $pos, 0, array($value)) or array_splice(&$arr, $pos, 0, $value) in case you’re sure that $value is a scalar, that is not an array and not an object, otherwise every element of the array or every property of the object would be inserted at the offset $pos (and counting) in $arr, shifting to the right the elements following $pos (if any). Wrapping $value into a singleton (an array()) as in the first example “fixes” this behaviour, if it’s not what you want (you can insert more than one element at once).

The third argument, which we’ll see later, targets elements to be deleted.

array_splice($arr, 0, 0, $val) is generally slower than array_unshift() and similarily for array_splice($arr, count($arr), 0, $val) vs. $arr[]=$val;

If you just have to insert a value into an associative array, you just index it, the usual way ($arr[’my-key’] = ‘my val’).

To add values to associative arrays you just set it $arr[’key’]=’val’;

If you want insert a value with a given string key, at a given position (remember that arrays in php have an internal order), you can’t use array_splice because

array_splice($arr, $n, 0, array($key=>$val))

won’t put $val at the $n-th offset with the key $key, but it would change $key into a 0 and putting it after $pos (despite what O’Reilly’s “Programming PHP” 2nd ed.” says :p).

You might want to use instead this function found in the comments of the online manual

function array_insert (&$array, $position, $insert_array) {
  // array_splice returns the slice of array 
  // deleted from the first array, see next section
  $first_array = array_splice ($array, 0, $position); 
  $array = array_merge ($first_array, $insert_array, $array);
}

If you need it in alphabetical order, you can also set the key,val couple and then sort the array by key with ksort; I didn’t check what would be quicker, though. I suppose that a good bisection algorithm coupled with this array_insert would still be quicker than the ksort() approach.

Deleting values

array_splice($arr, $pos, $count) deletes $count values in $arr starting from $pos and returns the deleted values. However, reading the comments of the online PHP manual about array_splice, unset($arr[$offset]) seems to be quicker than splice, but works in a different way: array_splice reindexes the array, while unset() just remove the couple (key, value), leaving a gap.

$arr = array(1,2,3,4);
array_splice($arr, 2, 1); // $arr == array(0=>1, 1=>3, 2=>4);
// but:
$arr = array(1,2,3,4);
unset($arr[3]); // $arr == array(0=>1, 2=>3, 3=>4);

If you do mind about this gap (i.e. you don’t have an associative array and/or you don’t walk it with foreach or while/each) you can still get some improvements in speed, re-indexing the array by using $arr = array_values($arr), which would return a numerically-indexed array containing only the actual values (so don’t use it with associative arrays).

Working with tuples

In Python there is this nifty thing called ‘tuples’, which is just what it would sound like in math; in PHP, again, we only have arrays; however quite surprisingly (for me), the Python code using pairs worked almost the same using two-element arrays; where python did

 
# create a pair
pair = (1, 2)
# shorthand for 
# a = pair[0]; b=pair[1]
a, b = (1, 2) 

PHP could do

$pair = array(1, 2);
list($a, $b) = $pair 

and comparisons worked, too!

 (1,2) > (0,1) # true
array(1,2) > array(0,1) # true

…even though PHP is obviously much longer to type.

I made sure that the code worked as intended first using this 1:1 translation.

Pairs were only used to define (string-key, value) relationships in a b-tree node and then pass a list of them along; the order relation between pairs didn’t need to be bi-dimensional.

In fact this is what you get when you do (a,b)>(c,d): the relation “>” has got an answer, whatever a,b,c,d are, even if a==c; supposing you’re mapping points in a 2d plan it is important to know that (for instance) (1,5)>(1,4) is true, but a b-tree is supposed NOT to contain duplicate key (key is pair[0]), so we don’t have to deal with the case

(K, value_1)>(K, value_2)

as value_2 would in turn replace value_1.

Moreover, the code

array(K_a,a)>array(K_b,b)

is significantly slower than doing only K_a>K_b, and since only the string keys were really relevant in the comparison, I’ve created a class wrapping around two distinct arrays, one for the keys and one for the values and the code was reduced into performing comparisons only in the indices array, and then reflecting changes/moves/deletions into the array of the values.

Conclusion

Well, I hope all of this has been useful to you.

Have fun :)

[footnotes]

[1] of course remember that if you’ll overflow beyond this number, then you’ll get the usual insertion-order behaviour