CS 1501 Programming Project 4
Implementing a Unix-like Diff program
The goal of this programming project is to implement a modification of the Unix diff program. The Unix diff program is implementing a version of the edit distance algorithm discussed in class and at http://www.merriampark.com/ld.htm (local backup copy). Review as needed to make sure that you understand the underlying algorithm. The Unix diff program is called by a command line of the form:
diff file1 file2
We now describe the version of diff that you will implement. You have write two programs. The second program will be a refinement of the first program that uses less space, perhaps at the cost of not giving the exact minimum edit distance for non-typical instances.
Program 1: Use the code found at http://www.merriampark.com/ld.htm as a starting point. We now give some implementation requirements. You must use hashing to compute a representative 64 bit integer for each line. File1 is represented by an array F1, and File2 is represented by an array F2. The value of position i in each array should be the hash value of line i in the file. Use the hash function used in the Rabin-Karp algorithm on page 289 of the text. You must use Horner's method to compute the hash value.You will have to use d=27=128 since we have an ASCII file. You must use the Mersenne prime q=231-1 as your prime to mod by (see the code on page 290 of the text). Note that this prime is sufficiently small that you will not have to worry about overflow (see the discussion after the code on page 290). You may assume that if the hash value of two lines are equal, then the corresponding lines are equal. Thus there is some chance that the output of your program will not be correct. Each line of the output should have one of the three forms:
So the output of your program will not exactly match the unix diff command in either style or content. You need to have your output exactly of this form, because we will use a program to check the correctness of your answers.
Program 2: One problem is that as described in class and in http://www.merriampark.com/ld.htm, diff would use quadratic space in the size of the files. This will be unacceptable if the files have many lines. The command line will be of the following form:
diff k file1 file2
where k is a non-negative integer. Intuitively, this enforces the requirement that lines i and j can not be matched if they are more than k lines apart from each other.
To define this formally, let n be the size of file 1, and m be the size of file 2. Let the function floor(x), where x is a rational number, be the biggest integer ≤ x. So floor(2.2)=2, floor(2.8)=2, floor(3)=3.
Entries (i, j) where j = i*m/n are on the main "diagonal" of the array. You will probably have to put some thought into understanding the above conditions. This will reduce the space to O(k*max(n, m)). Probably the easiest way to do this is to use a one dimensional array to store these defined values. The array entries are computed as before. So we can change the description of the algorithm to:
Construct a matrix d containing 0..n rows and 0..m columns.Initialize the first row to 0..m
Initialize the first column to 0..n.
Examine each line of file 1 (i from 1 to n)
Examine each line of file 2 (j from 1 to m)
If d[i, j] is valid then
If file1[i] equals file2[j], d[i, j] = d[i-1, j-1]
else set d[i, j] to the minimum of
- d[i-1,j] + 1 if d[ i-1, j] is valid
- d[i,j-1] + 1 if d[i, j-1] is valid
- d[i-1,j-1] + 1 if d[i-1, j-1] is valid
After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]
By tracing back from d[n, m], output the minimum edit sequence
Your program must use space at most O(k*max(file1 size, file2 size)).
You should turn in your two programs. You should turn the input of your first program on the first two instances here (ex1_1.txt should be paired with ex1_2.txt, ex2_1.txt should be paired with ex2_2.txt), and the output of your second program with k=500 on these 7 pairs of input files (ex1_1.txt should be paired with ex1_2.txt, ex2_1.txt should be paired with ex2_2.txt, etc.) Do not turn in the original test files. No write-up is required for this project.