CS 1501 Programming Project 5

The goal of this programming project is to implement a modification of the Unix diff program. The Unix diff program is implementing a version of the edit distance algorithm discussed in class and at http://www.merriampark.com/ld.htm (local backup copy). Review as needed to make sure that you understand the underlying algorithm. The Unix diff program is called by a command line of the form:

where n1 and n2 represent lines file1 and n3 and n4 represent lines in file2. These lines resemble Unix ed commands to convert file1 to file2. "a" stands for append, "d" for delete and "c" for change. By exchanging a for d and reading backward, file2 can be converted to file1. Identical pairs, where n1=n2 or n3=n4, are abbreviated as a single number. Following each of these lines come all the lines that are affected in the first file flagged by `<', then all the lines that are affected in the second file flagged by `>'. Play with the Unix diff command on some short files until you understand the output convention. Note that Unix diff command is not implemented exactly as we describe below. But what we describe is close enough for our purposes.

We now describe the version of diff that you will implement. You have write two programs. The second program will be a refinement of the first program that uses less space, perhaps at the cost of not giving the exact minimum edit distance for non-typical instances.

Program 1: Use the code found at http://www.merriampark.com/ld.htm as a starting point. We now give some implementation requirements. You must use hashing to compute a representative 64 bit integer for each line. File1 is represented by an array F1, and File2 is represented by an array F2. The value of position i in each array should be the hash value of line i in the file. Use the hash function used in the Rabin-Karp algorithm on page 289 of the text. You must use Horner's method to compute the hash value.You will have to use d=2⁷=128 since we have an ASCII file. You must use the Mersenne prime q=2³¹-1 as your prime to mod by (see the code on page 290 of the text). Note that this prime is sufficiently small that you will not have to worry about overflow (see the discussion after the code on page 290). You may assume that if the hash value of two lines are equal, then the corresponding lines are equal. Thus there is some chance that the output of your program will not be correct. Each line of the output should have one of the three forms:

a, n1, n2
- This means that a line is inserted after line n1 in file 1 and this inserted line is line n2 from file 2
d, n1, n2
- This means that line n1 is deleted from file 1, and if this line was left in, it would have been line n2 in file 2
c, n1, n2
- This means that line n1 in file 1 is replaced by line n2 in file 2

So the output of your program will not exactly match the unix diff command in either style or content. You need to have your output exactly of this form, because we will use a program to check the correctness of your answers.

Program 2: One problem is that as described in class and in http://www.merriampark.com/ld.htm, diff would use quadratic space in the size of the files. This will be unacceptable if the files have many lines. The command line will be of the following form:

diff k file1 file2

where k is a non-negative integer. Intuitively, this enforces the requirement that lines i and j can not be matched if they are more than k lines apart from each other.

To define this formally, let n be the size of file 1, and m be the size of file 2. Let the function floor(x), where x is a rational number, be the biggest integer ≤ x. So floor(2.2)=2, floor(2.8)=2, floor(3)=3.

Assume that m ≤ n, that is, there are more lines in file 1 than file 2. Then d[i, j] is defined only if
- j is at least floor(i*m/n) - k,
- and j is at most floor(i*m/n) + k
Assume that n ≤ m that is, there are more lines in file 2 than file 1. Then d[i, j] is defined only if
- i is at least floor(j*n/m) - k,
- and i is at most floor(j*n/m) + k

Entries (i, j) where j = i*m/n are on the main "diagonal" of the array. You will probably have to put some thought into understanding the above conditions. This will reduce the space to O(k*max(n, m)). Probably the easiest way to do this is to use a one dimensional array to store these defined values. The array entries are computed as before. So we can change the description of the algorithm to:

Construct a matrix d containing 0..n rows and 0..m columns.

Initialize the first row to 0..m
Initialize the first column to 0..n.

Examine each line of file 1 (i from 1 to n)

Examine each line of file 2 (j from 1 to m)

If d[i, j] is valid then

If file1[i] equals file2[j], d[i, j] = d[i-1, j-1]

else set d[i, j] to the minimum of

d[i-1,j] + 1 if d[ i-1, j] is valid

d[i,j-1] + 1 if d[i, j-1] is valid

d[i-1,j-1] + 1 if d[i-1, j-1] is valid

After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]

By tracing back from d[n, m], output the minimum edit sequence

Your program must use space at most O(k*max(file1 size, file2 size)).

You should turn in your two programs. You should turn the input of your first program on the first two instances here (ex1_1.txt should be paired with ex1_2.txt, ex2_1.txt should be paired with ex2_2.txt), and the output of your second program with k=500 on these 7 pairs of input files (ex1_1.txt should be paired with ex1_2.txt, ex2_1.txt should be paired with ex2_2.txt, etc.) Do not turn in the original test files. No write-up is required for this project.