This article is a mess, and people need to stop voting it up. For example, it contains this perplexing passage:
"Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”"
What the heck is he talking about here?
His hash function is O(m), where m is the length of the pattern being searched for. This makes the implementation trivially O(mn), which he notes, writing in another passage:
"The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. [...] However it’s considered that Rabin-Karp’s complexity is O(n+m) in practice"
Next time I submit a code review, I'm going to "consider" my complexity this way.
This isn't a good guide to Rabin-Karp, which is effectively discussed on Wikipedia, in Cormen, etc. It is probably a very effective way of attracting ad dollars for the banners all over the page.
Hmm? Expected running time is a valid and important concept. In Quicksort, for example, picking the pivot element randomly results in O(n^2) runtime in the worst case, but has an expected runtime of O(n log(n)). The O(n log(n)) limit can be enforced using the median-of-medians method for pivot selection, but that's a pain to program and has a much higher constant factor than generating a random number; the only benefit being theoretical computer scientists sleep well at night.
It lacks an important detail: For any random hash function, the complexity is O(n * m) as in the naive implementation. You need a rolling hash function where you can get the next hash value of the next sub-string in constant type.
The implementation is also wrong. First of all, it has that detail wrong, thus it is O(n * m). But the wrong part is the comparison; it just compares the hash and not the value itself.
For plain string search, it's definitely much slower than Boyer-Moore or KMP.
... but the cool thing about Karp-Rabin is that it generalizes well to different problems.
* 2D pattern matching (requires a little bit of thinking to implement sliding hash blocks quickly).
* matching against a set of equal length strings instead of a single string by replacing the hash equality check with a set containment check. Aho-Corasick, a generalization of KMP to this problem, will outperform this though. Note this composes with previous modification, you can do 2D pattern matching against different sets of 2D patterns, as long as they have the same dimension.
* You can use it to build a data structure to quickly let you answer queries like is substring(a, b) == substring(c, d).
My favorite use is to pick content aware variable block sizes for a block-deduping file system. Imagine before you write a block, you compute a check sum and re-use a block if you've seen it already. You want to pick blocks so that if you don't store redundant copies of content you have. Imagine you have a 1GB file A, and some modified file version A' that has a few character insertions at the beginning. Ideally, you would reuse most of the blocks for the files, since they are basically identical. The naive strategy of picking say, every 8k as a block isn't going to re-use any of the contents, because the two files aren't lined up. Instead, imagine you keep a rolling hash of length 16. Whenever the rolling hash mod 8k == 0, you start a new block. Now you'll miss some of the commonality early in the file, but eventually the same window will cause the blocks to restart in an aligned place in A and A', so you'll only store one copy of most of the data. See section 3.1.1 in this paper for more details: http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf
Those are for matching a single pattern against a text. The intended usage of Rabin-Karp is to match a set of patterns against a text, in a more efficient manner than running Boyer-Moore or KMP on each individual pattern. An algorithm with a similar use-case to Rabin-Karp is the Aho-Corasick algorithm, which is very simple to learn if you know KMP already - the preprocess of the set of patterns is just a generalization of the preprocess of the pattern in KMP.
If I want to match multiple patterns at the same time, wouldn't it be more efficient to build a (possibly tagged) DFA from them? Then the performance would be independent from the number of patterns to match (modulo some startup cost to construct the DFA in the first place).
Maybe you were thinking of something like Aho-Corasick or Knuth-Morris-Pratt or Boyer-Moore? We really need better names for these algorithms. For example, by the authority vested in me by my own hubris, I hereby rename Rabin-Karp to "rolling hash search". Aho-Corasick is now "substring-trie search". Much easier to remember!
There is a string search based on the BWT which is used for most DNA read mapping these days due to its favorable performance compared to existing hash based algorithms.
Company I work at (in HPC) was in negotiations to implement it for one of those big plant biotech firms long before I was around, or so I have been told.
General alignment definitely is quite different, however The vast majority of sequence data that's produced these days are short reads. Read-mapping is a much lower-bar and more similar to string search than proper sequence alignment. The BWT-based search [1] is a bit more practical with the small alphabet size of DNA compared to most human texts, however.
"Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”"
What the heck is he talking about here?
His hash function is O(m), where m is the length of the pattern being searched for. This makes the implementation trivially O(mn), which he notes, writing in another passage:
"The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. [...] However it’s considered that Rabin-Karp’s complexity is O(n+m) in practice"
Next time I submit a code review, I'm going to "consider" my complexity this way.
This isn't a good guide to Rabin-Karp, which is effectively discussed on Wikipedia, in Cormen, etc. It is probably a very effective way of attracting ad dollars for the banners all over the page.
Avoid, avoid.