CS 1501

String Matching Summary

 

Basic Idea:

We are searching for a pattern string, P, of length M within a text string, A, of length N.  Our search will be successful if, starting at some location i and proceeding to location i+M-1 in the text, the characters in the text match the corresponding characters from location 0 to location M-1 in the pattern.

 

Uses:

Searching the Web, searching an online library catalog, searching for words in documents, etc.

 

Brute Force (Naive) Algorithm:

Conceptually, this algorithm is quite simple.  Line up the pattern under the text and see if they match (character by character, starting at the beginning of both the text and the pattern).  If they do not match, slide the pattern to the right one position and try again.  The code (see correction mention on my Web page announcements) is given in the Sedgewick text.

 

This algorithm performs quite well in normal circumstances, especially with alphanumeric files, and usually will have a run time proportional to M + N.  However, in the worst case it can have poor performance, with run time proportional to MN.  This occurs when the pattern and text both contain a long repeated sequence of a single character.  For example:

 

text string, A

aaaaaaaaaaaaaaaaaaab

pattern string, P

aaaab

 

Now four matched comparisons and one mismatched comparison will be done each time the pattern is slid a single location down the text.  Noting that N = 20 and M = 5, to find P would require:

5 comps (starting from i = 0) + 5 comps (from i = 1) + ... + 5 comps (from i = 15) =

5 * 16 = 80 comparisons.

We can generalize this to M(N-M+1) which gives a worst case proportional to MN (if, as is normally the case N is much larger than M).

 

Improvements:

The problem in the worst case with the brute force algorithm is that it rechecks many characters, since it backs up i (in the text) and j (in the pattern) whenever a mismatch is detected.  However, this backing up is necessary to be sure that a potential match is not missed.  If we can in some way "remember" the characters we have already compared, we may be able to avoid rechecking them and still not miss any potential matches.

 

Knuth Morris Pratt (KMP):

The idea behind the KMP algorithm is to preprocess the pattern, so that, upon a mismatch, we can determine exactly how far we can move the pattern without missing a potential match.  This preprocessing is done by comparing the pattern to itself, seeing which characters produce a pattern within the pattern.  Let's look at two extreme cases:

1)      The pattern "abcde" has no sub-patterns, and thus allows us to slide the leading 'a' up to the mismatched character for any mismatch.   We then start comparing again at the beginning of the pattern.

A

aababcabcdabcde

 

P

abcde

1 comp before mismatch

 

 abcde

2 comps before mismatch

 

   abcde

3 comps before mismatch

 

      abcde

4 comps before mismatch

 

          abcde

5 comps to match

 

2)      The pattern "aaaab" has 'a' repeated 4 times, so, assuming we have a mismatch at 'b',  we cannot slide down more than one space.  However, since we know the 'a's before the mismatched 'b' all match the text string, we can resume comparing at the point of the mismatch, rather than going back to the beginning of the pattern.

A

aaaaaaaaaaaaaab

 

P

aaaab

4 comps before mismatch

 

 aaaab

1 comp before mismatch

 

  aaaab

1 comp before mismatch

 

      ...

 ...

 

          aaaab

2 comps to match

 

The KMP algorithm knows where to resume comparing within the pattern by using a preprocessed next array.  The next array contains the index values on which to resume comparing (in the pattern) given a mismatch at each location in the pattern.  For example, in example 1) above, next[4] = 0, meaning that after a mismatch at index 4 of the pattern (P[4] != A[10] on the second last line of the table), we next compare index 0 of the pattern.  The index in the text is not changed, so we simply recompare the same character again (P[0] == A[10] on the last line of the table).

 

KMP has overhead associated with creating the next array, and, except in extreme cases (example 2) above) does not perform much better than the Brute Force algorithm.  However, it is Theta(M+N) in the worst case and it does avoid backing up in the text string.  This "forward only" property is potentially valuable if a text string is being processed in real time (ex. read as it is transmitted).

 

Boyer Moore (BM)

The basic idea of the Boyer Moore algorithm is similar to that of KMP: do preprocessing to allow more skipping in the event of a mismatch.  However, BM takes a different approach: start comparing the pattern from the end rather than the beginning.  In this way, if the pattern is being compared with a part of the text that it does not match at all, large skips can be made with very few comparisons.  For example,

A

wxyzzyxwssabcde

 

P

abcde

0 comps before mismatch

 

     abcde

0 comps before mismatch

 

          abcde

5 comps to match

 

In the example above, 'e' and 'z' mismatch at the first comparison, so the pattern can be slid down its entire length and the A[0] to A[3] are not even looked at.   This is valid because no part of the pattern can possibly match the 'z', so we can start again after 'z'.  However, this is not always the case.  If the mismatched character in the text matches some other character in the pattern, we cannot skip as far, since the pattern may match after a small shift.  For example,

A

xyxyxxyxyyxyxyz

 

P

xyxyz

0 comps before mismatch

 

  xyxyz

0 comps before mismatch

 

   xyxyz

0 comps before mismatch

 

     xyxyz

0 comps before mismatch

 

      xyxyz

0 comps before mismatch

 

        xyxyz

0 comps before mismatch

 

          xyxyz

5 comps to match

 

In the example above, 'z' (P[4]) and 'x' (A[4]) mismatch at the first comparison.  However, the mismatched character in the text string, 'x', appears in the pattern 2 places previous to the mismatch, so we can slide the pattern over only 2 spaces.  Similarly, in the next mismatch, P[4] != A[6], A[6] is 'y', and 'y' appears 1 place previous to the mismatch, so we can slide the pattern over only 1 space.  The remaining "slides" are shown in the rest of the table.  Keep in mind that we always start the comparisons with P[4].

 

The preprocessing done with the Boyer Moore algorithm initializes each location in a skip array (indexed on the entire alphabet used) to M.  It then looks at the pattern and lowers the skip value of letters appearing in the pattern to ensure that a potential match will not be missed.  Note that in the Boyer Moore algorithm, the character used to determine the skip is the letter in the text, while in the KMP algorithm, the character used is the letter in the pattern.  However, the "mismatched character" heuristic here is just one of two parts to the actual Boyer Moore algorithm.  The second part (or first part, if you go by the order that they are presented in the book) calculates a next array in a manner similar to that of KMP.   The "meta" Boyer Moore algorithm would choose whichever of the two results that provides the largest net "skip".  The "meta" Boyer Moore algorithm is guaranteed to run in Theta(M+N) time.

 

In practice, Boyer Moore is often much faster than KMP, since it can skip over large sections of text with very few comparisons (KMP must look at each character in the text at least once; Boyer Moore skips many characters in the text entirely).  In favorable conditions, the run-time of Boyer-Moore approaches Theta(N/M) time, since up to M characters can be eliminated per comparison.  However, Boyer Moore accesses the text from right to left, so it may not be practical in some situations.

 

Rabin Karp (RK)

The Rabin Karp algorithm takes a completely different approach to string matching.  RK uses an approach closely related to hashing, and has the following basic idea:

 

Let h1 = a hash of the pattern, using a good hashing function for strings

Let h2 = the first M characters of the text, using the same hashing function

While (not found and not at the end of the text)

{

if (h1 == h2)

if (characters in pattern match characters in text)

found = true

if (not found)

incrementally change h2 by removing leftmost character and adding next character

}

 

In this way, no actual character comparisons are done unless the hash values match.  From our knowledge of hashing, we know that collisions can occur, so h1 could equal h2 even if the substrings do not match.  However, if we pick a good hash function and a very large table size, the chance of a collision will be very low.  Since we are not actually storing anything, we can make our "table size" a very large prime number.  We can also calculate our hash values using the same technique discussed previously (using Horner's method), which works in an incremental fashion.  Thus, the Rabin Karp algorithm will be Theta(M+N) in the average case (with the worst case of Theta(MN) occurring only if there are a lot of collisions, which is highly unlikely).

 

Conclusions

After seeing four different algorithms for the same task, it is natural to wonder which is the best.  As is the case with many algorithms, the answer is not always the same.  For very long text strings (ex. a large document like a textbook) the Boyer Moore algorithm tends to be the fastest.  However, the preprocessing required for BM negates its comparison advantages for shorter text strings.  In these cases, the Rabin Karp (with a quickly calculated hash function) may be preferred.  Also recall that the simple Brute Force algorithm tends to work well in normal circumstances, so if the text file is not too long, and a worst case scenario is not likely to occur, it can also be used with good results.