CS 1501

Algorithm Implementation

Programming Project 2

Important Note: Read the footnotes carefully – they contain a lot of important information!

 

Online: Tuesday, June 06, 2006

Due: All assignment materials: 1) All source files of program, 2) All .class files (or a .jar file containing them)  3) Well written/formatted paper as explained below, and 4) Assignment Information Sheet on the appropriate directory of the submission site by 11:59 PM on Tuesday, June 20, 2006.  Note 1: Do NOT submit the dictionary file or the input text files.  Note 2: If you are in the W section be sure to see the note about the second paper below.  This paper is due on the submission site on week after the on-time due date.

Late Due Date: 11:59PM on Thursday, June 22, 2006.

 

Motivation and Idea:

You should now be familiar with several different searching techniques, among them:

·        Linear (sequential) search of an unsorted array (or linked list), with an asymptotic runtime of Θ(N)

·        Binary search of a sorted array with an asymptotic runtime of Θ(lgN)

·        Searching through hashing with collision resolution with an asymptotic runtime of Θ(1)

 

Sometimes physical timing is the best way to see real differences between algorithms, so in this project you will compare the above algorithms through actual timing.  You will then write a short paper comparing your results and commenting on your findings.

 

Specification and Implementation Details:

You will set up your program so that you can easily evaluate the run-time differences between the algorithms, in an object-oriented way.  To do this you will use the following interface:

public interface searchTest

{

     public void insert(String s);

     public boolean find(String s);

}

You will then write three classes that implement searchTest:

            public class SeqDict extends ArrayList<String> implements searchTest

     public class SortDict extends ArrayList<String> implements searchTest

     public class HashDict implements searchTest

For class SeqDict you can largely use the functionality of the underlying ArrayList class.  In other words, the class will add the insert() and find() methods but it will do so using the functionality that is already part of the ArrayList[1].  Consequently, SeqDict will be almost trivial to implement.  SortDict will also use the underlying ArrayList to a large extent, but it will require some additional work: the insert() method must make sure the data is sorted (don't rely on the dictionary file being sorted) and the find() method must utilize binary search.  Think about how you will implement these and test them thoroughly before timing them – refer to your CS 0445 textbook for help with binary search, interfaces and object-oriented programming.

 

Class HashDict, on the other hand, must be built from scratch using double hashing, as we discussed in lecture.  You may NOT use any predefined Java hashing classes such as HashMap, HashSet or Hashtable.  You may assume that your HashDict class will store only String objects (i.e. you do not have to generalize it for other objects), so the underlying data structure for your hash table should be an array of String.  Regarding the hash function itself, you should start with the predefined hashCode() method for Strings.  However, be careful with the result – it returns a negative number for some Strings – you'll need to convert it to a non-negative number for it to map correctly.  Once you have a non-negative value, you can then simply take it modulo the table size for your final hash index.  Regarding the 2nd hash function (h2(x), used for the increment), start with the non-negative value that you determined for the primary hash function.  Call this value hcode.  Your h2(x) should then be

h2(x) =  (hcode % dhval) + 1

where dhval is the greatest prime number less than the table size.

 

You should keep your hash table at most 75% full, so you must incorporate a resizing ability into your insert() method.   Specifically, if the table gets over 75% full, you must create a new array that is approximately twice the size of the previous array, and rehash all of the data into the new table.  Since (as we discussed in lecture) for double hashing the table size should be a prime number, your new table size should be the smallest prime number greater than twice the previous table size.  You will also need to modify your 2nd hash function when you rehash, so that it remains effective.   Again make it the greatest prime number less than the new table size.  For example, if your previous table size was M = 83 and dhval = 79, you would need to resize after inserting the 63rd item into the table.  At that time you would:

1)      allocate a new table of size M = 167 and a new dhval = 163

2)      hash all of the old data into the new table (be sure to take the new size and dhval into account when hashing)

3)      assign your table variable to your new table (the old array will be garbage collected).

To efficiently test for primality, see the isProbablePrime() method of the BigInteger class[2],[3].  For consistency, in your default constructor, initialize your table size, M = 19 and your double hash mod value, dhval = 17.

 

Once you have written and tested all of your classes, you will compare their efficiencies with a simple "spell check" procedure in the following way:

·        Create an array of searchTest of size three.  Store a SeqDict object in location 0, a SortDict object in location 1 and a HashDict object in location 2.

·        Read in the words from a dictionary file whose name was input from the command line, inserting a copy of each word into each of the dictionary objects (in an object-oriented way – accessing the objects only through the searchTest interface – specifically the insert() method).

·        Read in the text file in order to calculate the time required for inputting and parsing the file.  This is needed in order to determine the actual time required for the searches later on.  To do this, follow the procedure in parts 1), 2), 3a) and 4) below.  Call the result fileTime.

·        Read in the text file that you will "spell check" using each of the dictionary objects.  For each dictionary object, proceed in the following way:

1)      Open the text file, using the file name input from the command line[4].

2)      Record the start time (see System.nanoTime())

3)      Read the file in a line at a time.   For each line in the file:

a)      Parse the line into words using the StringTokenizer[5] class

b)      Check for each of the words in the current dictionary using the find() method.  Keep track of the number of words found (i.e. spelled correctly) and the number not found (i.e. misspelled).

4)      Record the stop time (see System.nanoTime())

5)      Calculate the time required per search in the following way:

a)      Calculate the overall time for the searches by: [stop time] – [start time] – [fileTime].  Note that this will be in nanoseconds by default.  For readability you may want to convert it into a larger value such as microseconds or milliseconds (if possible).

b)      Calculate the average time per search by: [overall time]/[number of searches].  Use a double value to maximize the precision[6].

Note that if the searches are very fast (ex: hashing) the result from a) may be very small, or even negative, due to run-time issues (ex: computer may be doing some other task during part of the execution).  To minimize this likelihood, it is best to close all other applications when running your simulation.  Even so, if you get a negative value for a) you should handle this situation – treating it as a minimal run-time per search.

6)      Output the following results:

a)      Total words checked

b)      Number of words found

c)      Number of words not found

d)      Total time required for the searches

e)      Average time required per search

Note that a single run of your program must test all of the algorithms for a given dictionary file and input file.  The number of words found and not found should be the same for each of your dictionary implementations – you should use this fact to test the correctness of each implementation[7].  Test your program using each of the following dictionary files: words1.txt words2.txt words3.txt and with the following text files: medium.txt large.txt.  Note that, depending on the speed of your computer, the run-times for these input files will vary, and they could be fairly long, especially for the largest dictionary file and the worst dictionary implementation.  Keep in mind that since we are dealing with search of the dictionary, the file size of interest here is that of the dictionary file (number of words), not the text file.  The text file only determines how many searches we need to do, while the dictionary file determines how long each search will take.  The larger text file (large.txt) should give better results since the more searches that are done, the more reliable the timing results will be.  However, I have provided a smaller text file (medium.txt) to use for debugging purposes.

 

Once you have completed your test runs, write a short (~2 page, double-spaced) paper summarizing your results.  In your paper, be sure to include:

·        A table showing the results for all of your runs.  Indicate average time per search vs. number of words in the dictionary.

·        Comparison of the total search time and average time per search for each of the different search algorithms.  Discuss how they compare as the dictionary file size increases and which algorithm is preferable overall (in real terms – i.e. if an actual "spell check" of a file were being done – would all implementations even be feasible for a real spell checker?).

·        Discussion for each algorithm of how the asymptotic run-times and the actual measured run-times compare.  To do this you will need to determine the growth rates of the actual run-times and rectify them with the asymptotic run-times.  A good way to informally see the growth rates of the actual run-times is by using a graph.

 

Important Notes:

  • Be sure to include all of your source files, your executable .class files (or .jar file) and any write-up files in your submission.  However, DO NOT submit your data files – the TA already has these and submitting them will just waste space on the server.
  • Note that there are different parts to this project that can be demonstrated even if you don't complete the entire project.  For example, if you cannot get the hash table class to work you can extend the predefined Hashtable class and still run your main program.  You will lose credit for the hash table but you can still get credit for the main program and the write-up.
  • If you want to do some extra credit, here is an idea:  Add one additional dictionary implementation (still using the searchTest interface).  The 4th implementation will be the DLB that you implemented in Assignment 1 (with some minor modifications).  For full extra credit be sure to include all results and analysis for the DLB implemented dictionary.
  • Additional Work for W Section: In addition to all of the requirements above, write a short (~5 pages, double spaced) research paper on hashing.  Include a discussion on the origins of hashing, how various hashing techniques developed, and how various hashing techniques compare.  Be sure to discuss both hash functions and collision resolution techniques.  Use and cite MULTIPLE references.  Since this paper may require some time to research and write, it will be due one week after the project due date.


[1] Note that the parameter <String> allows the superclass ArrayList to be specified for Strings only.  This should not affect your subclass implementations.  For more information on class parameters and Java generics, see http://java.sun.com/j2se/1.5/pdf/generics-tutorial.pdf.

[2] We will discuss primality in more detail soon when will look at RSA encryption.

[3] To save on computation during resizing, you can easily pre-calculate (off-line, prior to running the program) a list of prime numbers to be used for the table sizes and put them into an array.  Since the table size doubles each time, even 20 values will allow very large tables to be generated (much larger than you will need).  You can use a similar approach for the dhval values.

[4] So the execution of the program might be >java assig2 words1.txt large.txt

[5] Use the following delimiter string for your StringTokenizer: " \t\n\r\f.,!?\";:()[]"

[6] To avoid the truncation associated with integer division, be sure to cast one of your operands to double PRIOR to doing the actual division.

[7] I will put the expected number of found and not found words (assuming the StringTokenizer indicated) online sometime next week.