CS 1501

CS 1501

Data Structures and Algorithms

Programming Project 3

We have discussed in lecture how the Boyer-Moore and Quicksearch algorithm can potentially give improved performance over the simple algorithm (brute-force) in the normal case. In addition, the Quicksearch algorithm only uses the bad character shift table and is a simplified version of Boyer-Moore. In this assignment you will implement and test all three algorithms to see for yourself if any performance gain is achieved with Quicksearch and Boyer-Moore. Both of these algorithms are implemented (to varying degrees) in the Sedgewick text for finding a string (array of chars) within another string (array of chars). Your assignment is as follows:

1) Modify the code in the text so that the algorithms will work using a FILE specified by the user for the "text string". In other words, the algorithms should try to match a pattern string within a file of arbitrary length. You may read in the data in the file one character at a time or multiple characters at a time, but any arrays you use MUST BE NO LONGER than 2M in length, where M is the length of the pattern. In other words, you are NOT allowed to read the entire file into a long array or string variable (or even one line of the file, since you cannot predict the line length) – you must process the file as a stream while you are doing the comparisons. This in fact can be done with an array of size M, but if you find it easier to use an array of size 2M that is also acceptable. Think carefully about how you can do this and trace it on paper before you actually code it – it most definitely requires some thought. Be sure to document your code (more so than in a normal program) so that it is very clear what you are doing and how you are doing it.

2) Add code to your algorithms that count the number of character comparisons required to do the match (or determine that no match is found). Note that this code should NOT count all instructions within the algorithms – just the character comparisons.

3) Implement your main program to run in an analogous way as shown in the example run shown (shows a comparison of the Brute-force and Boyer-Moore algorithms, yours will also include Quicksearch's), and to produce output as shown in the sample output file. Some important specifications are mentioned below:

a) The Pattern File will contain a new pattern on each line – your program should process all of the patterns in this file (resetting and searching the Text File for each one).

b) Each search will return either the byte location that the pattern was found in the file (starting at 0), or an indicator that the pattern was not found. In either case, keep track of the number of characters passed during your search. Naturally, if the pattern is not found, this should be approximately equal to the length of the file, and if the pattern is found, it should be the location in the file of the last character in the pattern.

c) Sum the total comparisons required and the total characters passed for all of the patterns in the pattern file, and output the average comparisons per character passed for the entire pattern file.

4) Run your program on the data files supplied, using the file names indicated below:

Output File	Text File	Pattern File
sim1a.out	medium.txt	shortp.txt
sim1b.out	medium.txt	longp.txt
sim2a.out	large.txt	shortp.txt
sim2b.out	large.txt	longp.txt
sim3a.out	worst.txt	wpata.txt
sim3b.out	worst.txt	wpatb.txt
bm1a.out	medium.txt	shortp.txt
bm1b.out	medium.txt	longp.txt
bm2a.out	large.txt	shortp.txt
bm2b.out	large.txt	longp.txt
bm3a.out	worst.txt	wpata.txt
bm3b.out	worst.txt	wpatb.txt

5) Write a short (~2 page) paper discussing your results. Be sure to minimally discuss each of the following issues:

a) How did the average comparisons per character differ for each algorithm between the 3 different text files and why?

b) How did the average comparisons per character differ for each algorithm between the 3 different pattern files and why?

c) Based on the searching algorithms and on your main program, do you think the overall run-time (in seconds) of the two algorithms will differ significantly? Give your best, supported opinion as to why or why not.

6) Submit all of your source files and executables as specified in the Submission Guidelines. Also submit all of your output data files. However, do NOT submit any input data files, since these are standard files that the TAs will also have. Don't forget your Assignment Information Sheet.

7) Extra Credit: If you want some extra credit (10 points), add the Rabin Karp and/or the KMP algorithms to your program, and get results for these as well.

8) W section:
Visit the website, http://www-igm.univ-mlv.fr/~lecroq/string/, to see visualizations of all of the important string matching algorithms. You must choose one of the algorithms not discussed in class. Learn how it works then write up a description, with examples at each stage of the algorithm explaining how it works. In other words, you are to carefully write up the description so that someone who hasn't seen this algorithm before will understand how it works and how it performs, and what resources it requires.