CS 1501
String
Matching Summary
Basic Idea:
We are searching for a pattern
string, P, of length M within a text string,
A, of length N. Our search will be
successful if, starting at some location i and proceeding to location i+M-1 in
the text, the characters in the text match the corresponding characters from
location 0 to location M-1 in the pattern.
Uses:
Searching the Web, searching an online library
catalog, searching for words in documents, etc.
Brute Force (Naive)
Algorithm:
Conceptually, this algorithm is quite simple. Line up the pattern under the text and see
if they match (character by character, starting at the beginning of both the
text and the pattern). If they do not
match, slide the pattern to the right one position and try again. The code (see correction mention on my Web
page announcements) is given in the Sedgewick text.
This algorithm performs quite well in normal
circumstances, especially with alphanumeric files, and usually will have a run
time proportional to M + N. However, in
the worst case it can have poor performance, with run time proportional to
MN. This occurs when the pattern and
text both contain a long repeated sequence of a single character. For example:
text string, A |
aaaaaaaaaaaaaaaaaaab |
pattern string, P |
aaaab |
Now four matched comparisons and one mismatched
comparison will be done each time the pattern is slid a single location down
the text. Noting that N = 20 and M = 5,
to find P would require:
5 comps (starting from i =
0) + 5 comps (from i = 1) + ... + 5 comps (from i = 15) =
5 * 16 = 80 comparisons.
We can generalize this to M(N-M+1) which gives a
worst case proportional to MN (if, as is normally the case N is much larger
than M).
Improvements:
The problem in the worst case with the brute force
algorithm is that it rechecks many characters, since it backs up i (in the
text) and j (in the pattern) whenever a mismatch is detected. However, this backing up is necessary to be
sure that a potential match is not missed.
If we can in some way "remember" the characters we have
already compared, we may be able to avoid rechecking them and still not miss
any potential matches.
Knuth Morris Pratt (KMP):
The idea behind the KMP algorithm is to preprocess
the pattern, so that, upon a mismatch, we can determine exactly how far we can
move the pattern without missing a potential match. This preprocessing is done by comparing the pattern to itself,
seeing which characters produce a pattern within the pattern. Let's look at two extreme cases:
1)
The
pattern "abcde" has no sub-patterns, and thus allows us to slide the
leading 'a' up to the mismatched character for any mismatch. We then start comparing again at the
beginning of the pattern.
A |
aababcabcdabcde |
|
P |
abcde |
1
comp before mismatch |
|
abcde |
2
comps before mismatch |
|
abcde |
3
comps before mismatch |
|
abcde |
4
comps before mismatch |
|
abcde |
5
comps to match |
2)
The
pattern "aaaab" has 'a' repeated 4 times, so, assuming we have a
mismatch at 'b', we cannot slide down
more than one space. However, since we
know the 'a's before the mismatched 'b' all match the text string, we can
resume comparing at the point of the mismatch, rather than going back to the
beginning of the pattern.
A |
aaaaaaaaaaaaaab |
|
P |
aaaab |
4
comps before mismatch |
|
aaaab |
1
comp before mismatch |
|
aaaab |
1
comp before mismatch |
|
... |
... |
|
aaaab |
2
comps to match |
The KMP algorithm knows where to resume comparing
within the pattern by using a preprocessed next
array. The next array contains the
index values on which to resume comparing (in the pattern) given a mismatch at
each location in the pattern. For
example, in example 1) above, next[4] = 0, meaning that after a mismatch at
index 4 of the pattern (P[4] != A[10] on the second last line of the table), we
next compare index 0 of the pattern.
The index in the text is not changed, so we simply recompare the same
character again (P[0] == A[10] on the last line of the table).
KMP has overhead associated with creating the next
array, and, except in extreme cases (example 2) above) does not perform much
better than the Brute Force algorithm.
However, it is Theta(M+N) in the worst case and it does avoid backing up
in the text string. This "forward
only" property is potentially valuable if a text string is being processed
in real time (ex. read as it is transmitted).
Boyer Moore (BM)
The basic idea of the Boyer Moore algorithm is
similar to that of KMP: do preprocessing to allow more skipping in the event of
a mismatch. However, BM takes a
different approach: start comparing the pattern from the end rather than the
beginning. In this way, if the pattern
is being compared with a part of the text that it does not match at all, large
skips can be made with very few comparisons.
For example,
A |
wxyzzyxwssabcde |
|
P |
abcde |
0 comps before mismatch |
|
abcde |
0 comps before mismatch |
|
abcde |
5 comps to match |
In the example above, 'e' and 'z' mismatch at the
first comparison, so the pattern can be slid down its entire length and the
A[0] to A[3] are not even looked at.
This is valid because no part of the pattern can possibly match the 'z',
so we can start again after 'z'.
However, this is not always the case.
If the mismatched character in the text matches some other character in
the pattern, we cannot skip as far, since the pattern may match after a small
shift. For example,
A |
xyxyxxyxyyxyxyz |
|
P |
xyxyz |
0 comps before mismatch |
|
xyxyz |
0 comps before mismatch |
|
xyxyz |
0 comps before mismatch |
|
xyxyz |
0 comps before mismatch |
|
xyxyz |
0 comps before mismatch |
|
xyxyz |
0 comps before mismatch |
|
xyxyz |
5 comps to match |
In the example above, 'z' (P[4]) and 'x' (A[4])
mismatch at the first comparison.
However, the mismatched character in the text string, 'x', appears in
the pattern 2 places previous to the mismatch, so we can slide the pattern over
only 2 spaces. Similarly, in the next
mismatch, P[4] != A[6], A[6] is 'y', and 'y' appears 1 place previous to the
mismatch, so we can slide the pattern over only 1 space. The remaining "slides" are shown
in the rest of the table. Keep in mind
that we always start the comparisons with P[4].
The preprocessing done with the Boyer Moore
algorithm initializes each location in a skip
array (indexed on the entire alphabet used) to M. It then looks at the pattern and lowers the skip value of letters
appearing in the pattern to ensure that a potential match will not be
missed. Note that in the Boyer Moore
algorithm, the character used to determine the skip is the letter in the text,
while in the KMP algorithm, the character used is the letter in the
pattern. However, the "mismatched
character" heuristic here is just one of two parts to the actual Boyer
Moore algorithm. The second part (or
first part, if you go by the order that they are presented in the book)
calculates a next array in a manner
similar to that of KMP. The
"meta" Boyer Moore algorithm would choose whichever of the two
results that provides the largest net "skip". The "meta" Boyer Moore algorithm
is guaranteed to run in Theta(M+N) time.
In practice, Boyer Moore is often much faster than
KMP, since it can skip over large sections of text with very few comparisons
(KMP must look at each character in the text at least once; Boyer Moore skips
many characters in the text entirely).
In favorable conditions, the run-time of Boyer-Moore approaches
Theta(N/M) time, since up to M characters can be eliminated per
comparison. However, Boyer Moore
accesses the text from right to left, so it may not be practical in some
situations.
Rabin Karp (RK)
The Rabin Karp algorithm takes a completely
different approach to string matching.
RK uses an approach closely related to hashing, and has the following
basic idea:
Let h1 = a hash of the pattern, using a
good hashing function for strings
Let h2 = the first M characters of the
text, using the same hashing function
While (not found and not at the end of
the text)
{
if (h1 == h2)
if
(characters in pattern match characters in text)
found
= true
if (not found)
incrementally change h2 by removing leftmost character and adding next character
}
In this way, no actual character comparisons are
done unless the hash values match. From
our knowledge of hashing, we know that collisions can occur, so h1 could equal
h2 even if the substrings do not match.
However, if we pick a good hash function and a very large table size,
the chance of a collision will be very low.
Since we are not actually storing anything, we can make our "table
size" a very large prime number.
We can also calculate our hash values using the same technique discussed
previously (using Horner's method), which works in an incremental fashion. Thus, the Rabin Karp algorithm will be
Theta(M+N) in the average case (with the worst case of Theta(MN) occurring only
if there are a lot of collisions, which is highly unlikely).
Conclusions
After seeing four different algorithms for the same task, it is natural to wonder which is the best. As is the case with many algorithms, the answer is not always the same. For very long text strings (ex. a large document like a textbook) the Boyer Moore algorithm tends to be the fastest. However, the preprocessing required for BM negates its comparison advantages for shorter text strings. In these cases, the Rabin Karp (with a quickly calculated hash function) may be preferred. Also recall that the simple Brute Force algorithm tends to work well in normal circumstances, so if the text file is not too long, and a worst case scenario is not likely to occur, it can also be used with good results.