CS 1501 Programming Project 3

Lossless Compression Schemes

Purpose: The purpose of this assignment is for you to understand and compare some lossless compression schemes.  In particular you will compare the compression using self-organizing search heuristics (move to front and transpose) algorithms that we discussed in lecture to two other compression algorithms.

General Details: In lecture we discussed (or will discuss) a number of lossless compression algorithms, including Huffman, LZW and the self-organizing search compression algorithm.  In this assignment you will implement the self-organizing search algorithm and compare its compression quality to 2 other compression algorithms.  You will test each algorithm on various different files to see how each performs under different circumstances.  You will then tabulate your results and write a brief report comparing the algorithms.

Specific Procedure:

1)      Thoroughly read over your notes from class and the on-line notes on the self-organizing search compression algorithm. The following web sites also give brief introductions to Move-To-Front (one particular self organizing strategy) compression: http://www.arturocampos.com/ac_mtf.html and http://www.rebolforces.com/articles/compression/2 The Transpose heuristic only moves the accessed item 1 position closer to the front. Try a few practice examples by hand until you are completely familiar with the algorithm and how and why it works.

2)      Write a program to implement the algorithms that will work for an arbitrary file.  You should be able to specify the algorithms and whether you want to compress or decompress from the command line. That is:

·        java ProgramNamec –m filename should do Move-to-Front compression.

·        java ProgramName –d –m filename should do Move-to-Front decompression.

·        java ProgramNamec –t filename should do Transpose compression

·        java ProgramName –d –t filename should do Transpose decompression

In your implementation the original codewords should be 16-bit sequences. (i.e. you will process the input file 16 bits at a time and your arrays should have 216 locations). Since your input file could be arbitrary bytes and since your output data can be fractions of bytes, both your input and output files must be BINARY files.  If you are unfamiliar with using binary files in Java, see the handout binaryio.java for some help.  Also look at the lzw.c handout for help with I/O of byte fractions.  Your implementation should allow for the following variations:

1.      Update the array using the Move-To-Front Heuristic. The running time of both your compress and uncompress procedures can be linear in the size of your arrays for each codeword processed. Note that the run-time for this algorithm would be much too horribly inefficient in practice. We offer as an extra credit the option of implementing this algorithm in a reasonably efficient manner.

2.      Update the array using the Transpose Heuristic. The running time of your compress and uncompress procedures should be constant time per codeword.

3)      Once your program is working correctly, you will compare its performance with that of two other compression algorithms.  We will provide you with a number of files to use for testing – see the Assignments page for the link (should be up by next week).  Specifically, for each test file you will compare the performance of 4 different executions:

1.      Your program using the Move-To-Front heuristic

2.      Your program using the Transpose heuristic

3.      The lzw.c program that was discussed in lecture.

4.      One additional compression algorithm.  You may use any compression algorithm here (ex: compress, gzip, pkzip or others), but you must briefly explain it in the write-up as specified below.

Run all programs on all of the files and for each file record the original size, compressed size, and compression ratio (compressed size / original size).  For each file record the compression ratio for each of the above algorithms.

4)      Write a short (~2 pages) paper that minimally discusses / elaborates on each of the following:

a)      The theory behind the self-organizing search heuristics – why they work for compression and what their limitations are.

b)      Any implementation issues / problems you encountered and how you resolved them.

c)      How the compression ratios compare for the various test files.  Where there were differences between them, be sure to explain why.  If algorithms performed differently on different test files, explain (or speculate) why.  Speculate as to which algorithm (if any) was the best overall.

d)      For all algorithms, indicate which of the test files gave the best and worst compression ratios, and speculate as to why this was the case.  If any files did not compress at all or compressed very poorly, speculate as to why.

e)      Briefly explain the 4th algorithm that you used in your test runs.  State the algorithm and a paragraph or two about the theory behind it.  You should find information on most compression algorithms somewhere on the Web.

f)        Include in your paper all of the compression ratio results that you recorded in 3) above.

g)      If you are in the W section, your paper should minimally be 4 pages, with at least one full page of explanation of your 4th algorithm choice.  As with the other projects, the paper will count more toward your overall score in the W section than in the regular sections.

5)      Submit your source code, runnable .jar file (or all .class files), write-up paper and your Assignment Information Sheet to the appropriate submission directory.  DO NOT SUBMIT THE TEST FILES – they will use up too much memory on the submission site.  As always, make sure that any executables will be able to be executed by the TAs without any modification.

6)      Extra Credit:

1.      Modify your self-organizing search compression algorithm to be more run-time efficient for the MTF algorithm.  As specified, the MTF algorithm may require linear shifting in the size of your arrays for each codeword that is processed.  If a large file is being compressed, the overall run-time in this case could be prohibitive.  To reduce the overhead, rather than an array, a data structure such as a binary search tree can be used to store your codewords (with the position being the position in an inorder traversal of the tree).  This requires some thought and complexity to implement, since you will need direct access into the tree via codeword values and you will need to carefully modify the tree with each "move to front" operation.

2.      Modify your self-organizing search compression algorithm to give slightly better compression by using all of the possible bit combinations for each number of bits.  As mentioned in the online-notes, for bit-length values from 2 on, only ½ of the possible codewords for that number of bits is used.   However, to use all possible bit combinations you must process the original codewords in a more complicated way (i.e. you cannot simply shift them).