A Tale of Two Languages

Recently I was talking to a colleague who is still in school and he mentioned a new blog post he had written about string search. He had implemented a brute force search and so I told him about faster algorithms that he should consider, specifically mentioning the Knuth-Morris-Pratt algorithm. He read up on it at Wikipedia and wrote a Perl implementation, but when we ran it I was shocked to discover that KMP was significantly slower than brute force! This made no sense. Further experimentation showed that Perl’s index function was even faster still!

When I wrote up an implementation of KMP and brute force in C, I got results that were in line with my expectations: KMP in C was much faster than brute force.

So what is going on?

What’s happening is a result of the way Perl implements it’s data types. They do not, after all, correspond to the primitive types available in C/C++, Java, or other languages. In particular a Perl “array” is actually a complex data structure that is manipulated by a set of non-trivial functions, while a string is a less complex (though still complex in it’s own right) structure, so it really makes sense that string processing would be faster than array processing in Perl. Furthermore, Perl’s index function implements the Boyer-Moore string search algorithm, which is even faster than KMP, so the speed of the index function should come as no real surprise.

The take away lesson here is that even when you know of a good algorithm for solving a problem, you should not rush to write it up. A lot of very smart people have spent their time creating and fine tuning the languages and libraries that we use, so it is quite likely that someone has already done the work you need, and there may be a better alternative to your own solution built right in to your language. Check it out before wasting your time reinventing the wheel!

Update, August 18, 2015

There is actually a second lesson here, which may be even more important. Just because the language you are using gives you a data structure that has an array like interface, that doesn’t mean it is actually an array, and if it isn’t an actual array, then standard algorithmic analysis techniques may not apply to it! This is the case for Perl’s “arrays”, and also for lists in Python, Lisp, and other languages. This is why KMP is slower than brute force in Perl: the assumptions made in the standard analysis of the algorithms is that we are working with real arrays – a group of primitive data elements stored in contiguous locations in memory which guarantees us constant time lookup of any arbitrary data element with a very small time per operation. Perl arrays fail on all counts, here.

kmp.c – Test program in C.
kmp.pl – Test program in Perl.
A Tale of Two Cities – Some sample data for searching.