CS 1501 Algorithm Implementation

Programming Project 2: Repeated Strings

 

 

Repeat Search – finding the longest repeated  substring

Your assignment is to implement and evaluate three different algorithms  for finding the longest substring that occurs at at least two different positions in the input text file.  For example, if the input file was:

 
this is a test to use to test the redundant string detector.
 

then the answer would be the substringstring: " test t" starting in positions 10 and 25 if the text index starts at 0

 

Each of your programs should read from standard input and write to standard output. Each of the algorithms reads a text file into a char array A. We will assume that the text is typical English-language text; not necessarily random, but also necessarily not degenerate. That is, you can't base your algorithm on the string being purely random, but, on the other hand, you needn't worry about degenerate cases (for example, strings consisting of all blanks). You should treat upper and lower case characters as being equal. So for example “A”=”a”. Treat all non-alphabetic characters as white space and collapse all white space to a single space between words. So for example a file:

 

I love my MTV !! :-)

 

<tab> Go!    DOG, go!

 

Would be stored in the array as:

 

i love my mtv go dog go

 

 

 

You can specify a solution by variables L, which is the starting position of the first occurrence of the substring, R, which is the starting position of the second occurrence of the substring, and Length, which is the length of the substring.

 

Algorithm 1: This is perhaps the most naïve algorithm:

 

MaxL=MaxR=maxLength=0

For L = 1 to n do

               For R = 1 to n do

                     Determine the longest k such A[L]A[L+1]…A[L+k-1] = A[R]A[R+1]…A[R+k-1]

                     If k>maxLength then maxL=L, maxR=R and maxLength=Length

 

Algorithm 2:

 

For L = 1 to n do

               Use the Boyer-Moore string matching algorithm to determine whether there is a better solution for this particular L

                than the previous L’s

 

So after 1000 times through the outer loop, this algorithm will have computed the best solution where the leftmost occurrence of the substring started in one of the first 1000 positions.

 

Algorithm 3: Consider the n suffixes S1 to Sn of the input. So if the input string was:

 

abbaabc

 

then the suffixes would be

 

S1=abbaabc

S2=bbaabc

S3=baabc

S4=aabc

S5=abc

S6=bc

S7=c

 

Now sort the suffixes alphabetically. Then the longest repeated string must be the longest common prefix between two consecutive suffixes. In the above example, the sorted order of the suffixes is:

 

S4=aabc

S1=abbaabc

S5=abc

S3=baabc

S2=bbaabc

S6=bc

S7=c

 

The pair of consecutive suffixes with the longest common prefix are S1 and S5, which share a prefix of length 2, namely “ab”. Note the “ab” is the answer for this input and the two strings start in position 1 and position 5. This suggests the following algorithm:

 

Sort the suffixes

Find the two consecutive suffixes in the sorted order with the longest common prefix.

 

Take some care to implement this algorithm as efficiently as possible. In particular, your algorithm must run in O(n log n) on normal English text. This means for example, that you can not explicitly represent the suffixes since the size of the suffixes is Omega(n2) . You will have to represent the suffixes by indexes into the original string. You may use a reasonable amount of extra space to help speed up your algorithm, but, naturally, don't waste space either.

So you will end up with three programs. Each program should read the name of a file from standard input and print seven results on standard output  in this order (and labeled clearly):

1)      algorithm used

2)      length of longest match

3)      the longest matching string

4)      start position of first occurrence (starting from 0)

5)      start position of second occurrence (starting from 0)

6)      total number of character accesses (counting repeats)

7)      average character accesses per character in the file (simply the previous result divided by the number of characters in the file)

 

Here is 1K, 25K, 200K, 1M, 3M, 6M, 10M  test data. (Sizes are approximate)

 

Along with your source and executable files (or .jar file), also turn in a short report listing the results that you obtained on runs of each of the test files, and explaining why the results differed or did not differ significantly between the three algorithms.  If some file is too large for one of the algorithms to finish, use the run times from the smaller inputs to give an estimate what the run time would have been for that file. Your report should explain how you obtained these estimtes. Your report should contain the filled in version of the following table:

Running time in seconds/minutes/hours/days/years as appropriate

 

Algorithm 1

Algorithm 2

Algorithm 3

1K

 

 

 

25K

 

 

 

200K

 

 

 

1M

 

 

 

3M

 

 

 

6M

 

 

 

10M

 

 

 

 

For non-W students, the report can be one page. For W students, the report should be 2-4 pages. W students should comment on any implementation problems that they faced, and how they overcame these problems.

Your grade for this assignment will be based upon getting both algorithms to work properly, the quality of your implementations, and on the quality of your write-up.  Also remember to turn in your Assignment Information Sheet.

 

If you are using string variables to compare your substrings, BE CAREFUL to do this in an efficient way.  For example, in the Java String class, for a String S, the operation:      S = S + newchar;      would not be one character access but actually Theta(S.length()) accesses, since a new String is made in this operation.  The StringBuffer class is a better choice here – given StringBuffer S:    S.append(newchar);    only resizes the StringBuffer if necessary, and you can size it originally so that resizing would never occur.  In this way, the append would be only a single character access.   A similar logic holds for C++ -- if you are using the string class, use the append() function rather than the + operator.