Faster string searching and matching with Rabin-Karp algorithm

ratzkewatzke · on April 7, 2012

This article is a mess, and people need to stop voting it up. For example, it contains this perplexing passage:

"Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”"

What the heck is he talking about here?

His hash function is O(m), where m is the length of the pattern being searched for. This makes the implementation trivially O(mn), which he notes, writing in another passage:

"The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. [...] However it’s considered that Rabin-Karp’s complexity is O(n+m) in practice"

Next time I submit a code review, I'm going to "consider" my complexity this way.

This isn't a good guide to Rabin-Karp, which is effectively discussed on Wikipedia, in Cormen, etc. It is probably a very effective way of attracting ad dollars for the banners all over the page.

Avoid, avoid.

ahelwer · on April 7, 2012

Hmm? Expected running time is a valid and important concept. In Quicksort, for example, picking the pivot element randomly results in O(n^2) runtime in the worst case, but has an expected runtime of O(n log(n)). The O(n log(n)) limit can be enforced using the median-of-medians method for pivot selection, but that's a pain to program and has a much higher constant factor than generating a random number; the only benefit being theoretical computer scientists sleep well at night.

albertzeyer · on April 7, 2012

It lacks an important detail: For any random hash function, the complexity is O(n * m) as in the naive implementation. You need a rolling hash function where you can get the next hash value of the next sub-string in constant type.

The implementation is also wrong. First of all, it has that detail wrong, thus it is O(n * m). But the wrong part is the comparison; it just compares the hash and not the value itself.

More about Rabin-Karp: http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

(Btw., are those tags auto-generated? It's funny that it got the 'rolling hash' tag but the article doesn't mention it at all.)

kevingadd · on April 7, 2012

No discussion of whether it's actually faster (no benchmarks, etc) other than some handwaving about algorithmic complexity. Meh. :(

Clever algorithm, but I would be shocked if it were actually faster than Boyer-Moore or KMP on a modern computer for average search cases.

robrenaud · on April 7, 2012

For plain string search, it's definitely much slower than Boyer-Moore or KMP.

... but the cool thing about Karp-Rabin is that it generalizes well to different problems.

* 2D pattern matching (requires a little bit of thinking to implement sliding hash blocks quickly).

* matching against a set of equal length strings instead of a single string by replacing the hash equality check with a set containment check. Aho-Corasick, a generalization of KMP to this problem, will outperform this though. Note this composes with previous modification, you can do 2D pattern matching against different sets of 2D patterns, as long as they have the same dimension.

* You can use it to build a data structure to quickly let you answer queries like is substring(a, b) == substring(c, d).

My favorite use is to pick content aware variable block sizes for a block-deduping file system. Imagine before you write a block, you compute a check sum and re-use a block if you've seen it already. You want to pick blocks so that if you don't store redundant copies of content you have. Imagine you have a 1GB file A, and some modified file version A' that has a few character insertions at the beginning. Ideally, you would reuse most of the blocks for the files, since they are basically identical. The naive strategy of picking say, every 8k as a block isn't going to re-use any of the contents, because the two files aren't lined up. Instead, imagine you keep a rolling hash of length 16. Whenever the rolling hash mod 8k == 0, you start a new block. Now you'll miss some of the commonality early in the file, but eventually the same window will cause the blocks to restart in an aligned place in A and A', so you'll only store one copy of most of the data. See section 3.1.1 in this paper for more details: http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf

tptacek · on April 7, 2012

This is such a great comment. Thank you.

tomkinstinch · on April 7, 2012

We shouldn't forget about Boyer-Moore or Knuth-Morris-Pratt either.

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sear...

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pr...

ahelwer · on April 7, 2012

Those are for matching a single pattern against a text. The intended usage of Rabin-Karp is to match a set of patterns against a text, in a more efficient manner than running Boyer-Moore or KMP on each individual pattern. An algorithm with a similar use-case to Rabin-Karp is the Aho-Corasick algorithm, which is very simple to learn if you know KMP already - the preprocess of the set of patterns is just a generalization of the preprocess of the pattern in KMP.

perlgeek · on April 7, 2012

If I want to match multiple patterns at the same time, wouldn't it be more efficient to build a (possibly tagged) DFA from them? Then the performance would be independent from the number of patterns to match (modulo some startup cost to construct the DFA in the first place).

tocomment · on April 7, 2012

Does this beat burrows wheeler?

pjscott · on April 7, 2012

The Burrows-Wheeler transform is something completely unrelated.

http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transfo...

Maybe you were thinking of something like Aho-Corasick or Knuth-Morris-Pratt or Boyer-Moore? We really need better names for these algorithms. For example, by the authority vested in me by my own hubris, I hereby rename Rabin-Karp to "rolling hash search". Aho-Corasick is now "substring-trie search". Much easier to remember!

epistasis · on April 7, 2012

There is a string search based on the BWT which is used for most DNA read mapping these days due to its favorable performance compared to existing hash based algorithms.

ahelwer · on April 7, 2012

You are probably thinking of Smith-Waterman:

http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorith...

Company I work at (in HPC) was in negotiations to implement it for one of those big plant biotech firms long before I was around, or so I have been told.

epistasis · on April 9, 2012

Smith-Waterman is the gold standard, but it's too slow for most applications so many heuristics have been developed.

BWA is the most popular tool these days for short-read mapping, and has a Smith-Waterman based approach for long reads as well:

http://bio-bwa.sourceforge.net/bwa.shtml

pjscott · on April 8, 2012

Sequence alignment and string search are different enough problems that we really should distinguish between them.

epistasis · on April 9, 2012

General alignment definitely is quite different, however The vast majority of sequence data that's produced these days are short reads. Read-mapping is a much lower-bar and more similar to string search than proper sequence alignment. The BWT-based search [1] is a bit more practical with the small alphabet size of DNA compared to most human texts, however.

http://people.unipmn.it/manzini/papers/focs00draft.pdf

tocomment · on April 7, 2012

So can any of here other string search algos be used for DNA mapping?

rcthompson · on April 7, 2012

Asymptotically, string search based on the BWT is worst-case O(m+n).