CS 1501 Programming Project 4

Implementing a Unix-like Diff program

 

 

 

The goal of this programming project is to implement a modification of the Unix diff program. The Unix diff program is implementing a version of the edit distance algorithm discussed in class and at http://www.merriampark.com/ld.htm (local backup copy). Review as needed to make sure that you understand the underlying algorithm. The Unix diff program is called by a command line of the form:

diff file1 file2

Here file1 and file2 are ASCII files. The Unix diff utility will compare  the  contents  of  file1  and file2  and write to standard output a list of changes necessary to convert file1  into  file2.   This  list  should  be minimal.  No output will be produced if the files are identical.
The normal output contains lines of these forms:


 

          n1 a n3,n4
          n1,n2 d n3
          n1,n2 c n3,n4


 

where n1  and  n2  represent  lines  file1  and  n3  and  n4 represent lines in file2. These lines resemble Unix ed commands to convert file1 to file2.  "a" stands for append, "d" for delete and "c" for change. By exchanging a for d and  reading  backward,  file2  can be converted to file1.  Identical pairs, where n1=n2 or n3=n4, are abbreviated as  a single number. Following each of these lines come all the  lines  that  are affected  in  the  first  file  flagged by `<', then all the lines that are affected in the second file flagged by `>'. Play with the Unix diff command on some short files until you understand the output convention. Note that Unix diff command is not implemented exactly as we describe below. But what we describe is close enough for our purposes.

 

We now describe the version of diff that you will implement. You have write two programs. The second program will be a refinement of the first program that uses less space, perhaps at the cost of not giving the exact minimum edit distance for non-typical instances.

 

Program 1: Use the code found at http://www.merriampark.com/ld.htm as a starting point. We now give some implementation requirements.   You must use hashing to compute a representative 64 bit integer for each line.   File1 is represented by an array F1, and File2 is represented by an array F2. The value of position i in each array should be the hash value of line i in the file. Use the hash function used in the Rabin-Karp algorithm on page 289 of the text. You must use Horner's method to compute the hash value.You will have to use d=27=128 since we have an ASCII file. You  must use the Mersenne prime q=231-1 as your prime to mod by (see the code on page 290 of the text). Note that this prime is sufficiently small that you will not  have to worry about overflow (see the discussion after the code on page 290).  You may assume that if the hash value of two lines are equal, then the corresponding lines are equal. Thus there is some chance that the output of your program will not be correct. Each line of the output should have one of the three forms:

So the output of your program will not exactly match the unix diff command in either style or content. You need to have your output exactly of this form, because we will use a program to check the correctness of your answers.

 

 

Program 2: One problem is that as described in class and in http://www.merriampark.com/ld.htm, diff would use quadratic space in the size of the files. This will be unacceptable if the files have many lines. The command line will be of the following form:

diff k file1 file2

where k is a non-negative integer. Intuitively, this enforces the requirement that lines i and j can not be matched if they are more than k lines apart from each other.

To define this formally, let n be the size of file 1, and m be the size of file 2. Let the function floor(x), where x is a rational number, be the biggest integer ≤ x. So floor(2.2)=2, floor(2.8)=2, floor(3)=3.

Entries (i, j) where j = i*m/n are on the main "diagonal" of the array. You will probably have to put some thought into understanding the above conditions. This will reduce the space to O(k*max(n, m)).  Probably the easiest way to do this is to use a one dimensional array to store these defined values.  The array entries are computed as before. So we can change the description of the algorithm to:

Construct a matrix d containing 0..n rows and 0..m columns.

Initialize the first row to 0..m
Initialize the first column to 0..n.

Examine each line of file 1 (i from 1 to n)

    Examine each line of file 2 (j from 1 to m)

        If d[i, j] is valid then

            If file1[i] equals file2[j], d[i, j] = d[i-1, j-1]

            else set d[i, j] to the minimum of

  • d[i-1,j] + 1 if d[ i-1, j] is valid
  • d[i,j-1] + 1 if d[i, j-1] is valid
  • d[i-1,j-1] + 1 if d[i-1, j-1] is valid

After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]

By tracing back from d[n, m], output the minimum edit sequence

Your program must use space at most O(k*max(file1 size, file2 size)).

You should turn in your two programs. You should turn the input of your first program on the first two instances here (ex1_1.txt should be paired with ex1_2.txt, ex2_1.txt should be paired with ex2_2.txt),  and the output of your second program with k=500 on these 7 pairs of input files (ex1_1.txt should be paired with ex1_2.txt, ex2_1.txt should be paired with ex2_2.txt, etc.) Do not turn in the original test files. No write-up is required for this project.