CS 1501
Data
Structures and Algorithms
Programming
Project 3
We have discussed in lecture how the Boyer-Moore
and Quicksearch algorithm can potentially give
improved performance over the simple algorithm (brute-force) in the normal
case. In addition, the Quicksearch algorithm only uses the bad character shift
table and is a simplified version of Boyer-Moore. In this assignment you will
implement and test all three algorithms to see for yourself if any performance
gain is achieved with Quicksearch and Boyer-Moore.
Both of these algorithms are implemented (to varying degrees) in the Sedgewick text for finding a string (array of chars) within
another string (array of chars). Your
assignment is as follows:
1)
Modify the code in the text so that the
algorithms will work using a FILE specified by the user for the "text
string". In other words, the
algorithms should try to match a pattern string within a file of arbitrary
length. You may read in the data in the
file one character at a time or multiple characters at a time, but any arrays
you use MUST BE NO LONGER than 2M in length, where M is the length of the
pattern. In other words, you are NOT
allowed to read the entire file into a long array or string variable (or even
one line of the file, since you cannot predict the line length) – you must
process the file as a stream while you are doing the comparisons. This in fact can be done with an array of
size M, but if you find it easier to use an array of size 2M that is also
acceptable. Think carefully about how
you can do this and trace it on paper before you actually code it – it most
definitely requires some thought. Be
sure to document your code (more so than in a normal program) so that it
is very clear what you are doing and how you are doing it.
2)
Add code to your algorithms that count
the number of character comparisons required to do the match (or determine that
no match is found). Note that this code
should NOT count all instructions within the algorithms – just the character
comparisons.
3)
Implement your main program to run in an
analogous way as shown in the example
run shown (shows a comparison of the Brute-force and Boyer-Moore algorithms,
yours will also include Quicksearch's), and to
produce output as shown in the sample
output file. Some important
specifications are mentioned below:
a)
The Pattern File will contain a new
pattern on each line – your program should process all of the patterns in this
file (resetting and searching the Text File for each one).
b)
Each search will return either the byte
location that the pattern was found in the file (starting at 0), or an
indicator that the pattern was not found.
In either case, keep track of the number of characters passed
during your search. Naturally, if the
pattern is not found, this should be approximately equal to the length of the
file, and if the pattern is found, it should be the location in the file of the
last character in the pattern.
c)
Sum the total comparisons required and
the total characters passed for all of the patterns in the pattern file, and
output the average comparisons per character passed for the entire pattern
file.
4)
Run your program on the data files
supplied, using the file names indicated below:
Output
File |
Text
File |
Pattern
File |
sim1a.out |
||
sim1b.out |
medium.txt |
|
sim2a.out |
shortp.txt |
|
sim2b.out |
large.txt |
longp.txt |
sim3a.out |
||
sim3b.out |
worst.txt |
|
bm1a.out |
medium.txt |
shortp.txt |
bm1b.out |
medium.txt |
longp.txt |
bm2a.out |
large.txt |
shortp.txt |
bm2b.out |
large.txt |
longp.txt |
bm3a.out |
worst.txt |
wpata.txt |
bm3b.out |
worst.txt |
wpatb.txt |
5)
Write a short (~2 page) paper discussing
your results. Be sure to minimally
discuss each of the following issues:
a)
How did the average comparisons per
character differ for each algorithm between the 3 different text files and why?
b)
How did the average comparisons per
character differ for each algorithm between the 3 different pattern files and
why?
c)
Based on the searching algorithms and on
your main program, do you think the overall run-time (in seconds) of the two
algorithms will differ significantly?
Give your best, supported opinion as to why or why not.
6)
Submit all of your source files and
executables as specified in the Submission Guidelines. Also submit all of your output data
files. However, do NOT submit any input
data files, since these are standard files that the TAs will also have. Don't forget your Assignment Information
Sheet.
7)
Extra
Credit: If you want some extra credit (10 points), add
the Rabin Karp and/or the KMP algorithms to your program, and get results for
these as well.
8)
W
section:
Visit the website, http://www-igm.univ-mlv.fr/~lecroq/string/,
to see visualizations of all of the important string matching algorithms. You
must choose one of the algorithms not discussed in class. Learn how it works
then write up a description, with examples at each stage of the algorithm explaining
how it works. In other words, you are to carefully write up the description so
that someone who hasn't seen this algorithm before will understand how it works
and how it performs, and what resources it requires.