CS 1501
Algorithm Implementation
Program
Important Note: Read the footnotes carefully – they contain a lot of
important information!
Online:
Due: All
assignment materials: 1) All source files of program, 2) All .class files (or a
.jar file containing them) 3) Well
written/formatted paper as explained below, and 4) Assignment Information Sheet
on the appropriate directory of the submission site by 11:59 PM on Tuesday, June 20, 2006. Note 1:
Do NOT submit the dictionary file or the input text files. Note 2:
If you are in the W section be sure to see the note about the second paper
below. This paper is due on the
submission site on week after the on-time due date.
Late Due Date:
11:59PM on Thursday, June 22, 2006.
Motivation and Idea:
You should now be familiar with several different searching techniques, among them:
· Linear (sequential) search of an unsorted array (or linked list), with an asymptotic runtime of Θ(N)
· Binary search of a sorted array with an asymptotic runtime of Θ(lgN)
· Searching through hashing with collision resolution with an asymptotic runtime of Θ(1)
Sometimes physical ti
Specification and Implementation Details:
You will set up your program so that you can easily evaluate the run-time differences between the algorithms, in an object-oriented way. To do this you will use the following interface:
public interface
searchTest
{
public void insert(String s);
public boolean
find(String s);
}
You will then write three classes that implement searchTest:
public class SeqDict
extends ArrayList<String> implements searchTest
public class SortDict
extends ArrayList<String> implements searchTest
public class HashDict
implements searchTest
For class SeqDict
you can largely use the functionality of the underlying ArrayList
class. In other words, the class will
add the insert() and find() methods but it will do so using the functionality
that is already part of the ArrayList[1]. Consequently, SeqDict
will be almost trivial to implement. SortDict will also use the underlying ArrayList
to a large extent, but it will require some additional work: the insert()
method must make sure the data is sorted (don't rely on the dictionary file
being sorted) and the find() method must utilize binary search. Think about how you will implement these and
test them thoroughly before ti
Class HashDict,
on the other hand, must be built from scratch using double hashing, as we discussed in lecture. You
may NOT use any predefined Java hashing classes such as HashMap,
HashSet or Hashtable. You may assume that your HashDict
class will store only String objects (i.e. you do not have to generalize it for
other objects), so the underlying data structure for your hash table should be
an array of String. Regarding the hash function itself, you should
start with the predefined hashCode() method for
Strings. However, be careful with the
result – it returns a negative number for some Strings – you'll need to convert
it to a non-negative number for it to map correctly. Once you have a non-negative value, you can
then simply take it modulo the table size for your final hash index. Regarding the 2nd hash function
(h2(x), used for the increment), start with the non-negative value that you
deter
h2(x) = (hcode % dhval) + 1
where dhval is the greatest prime number less than the table size.
You should keep your hash table at most 75% full, so you must incorporate a resizing ability into your insert() method. Specifically, if the table gets over 75% full, you must create a new array that is approximately twice the size of the previous array, and rehash all of the data into the new table. Since (as we discussed in lecture) for double hashing the table size should be a prime number, your new table size should be the smallest prime number greater than twice the previous table size. You will also need to modify your 2nd hash function when you rehash, so that it remains effective. Again make it the greatest prime number less than the new table size. For example, if your previous table size was M = 83 and dhval = 79, you would need to resize after inserting the 63rd item into the table. At that time you would:
1) allocate a new table of size M = 167 and a new dhval = 163
2) hash all of the old data into the new table (be sure to take the new size and dhval into account when hashing)
3) assign your table variable to your new table (the old array will be garbage collected).
To efficiently test for primality, see the isProbablePrime() method of the BigInteger class[2],[3]. For consistency, in your default constructor, initialize your table size, M = 19 and your double hash mod value, dhval = 17.
Once you have written and tested all of your classes, you will compare their efficiencies with a simple "spell check" procedure in the following way:
· Create an array of searchTest of size three. Store a SeqDict object in location 0, a SortDict object in location 1 and a HashDict object in location 2.
· Read in the words from a dictionary file whose name was input from the command line, inserting a copy of each word into each of the dictionary objects (in an object-oriented way – accessing the objects only through the searchTest interface – specifically the insert() method).
·
Read in the text file in order to calculate the
time required for inputting and parsing the file. This is needed in order to deter
· Read in the text file that you will "spell check" using each of the dictionary objects. For each dictionary object, proceed in the following way:
1) Open the text file, using the file name input from the command line[4].
2) Record the start time (see System.nanoTime())
3) Read the file in a line at a time. For each line in the file:
a) Parse the line into words using the StringTokenizer[5] class
b) Check for each of the words in the current dictionary using the find() method. Keep track of the number of words found (i.e. spelled correctly) and the number not found (i.e. misspelled).
4) Record the stop time (see System.nanoTime())
5) Calculate the time required per search in the following way:
a) Calculate the overall time for the searches by: [stop time] – [start time] – [fileTime]. Note that this will be in nanoseconds by default. For readability you may want to convert it into a larger value such as microseconds or milliseconds (if possible).
b) Calculate the average time per search by: [overall time]/[number of searches]. Use a double value to maximize the precision[6].
Note that if the searches are very fast (ex: hashing)
the result from a) may be very small, or even negative, due to run-time issues
(ex: computer may be doing some other task during part of the execution). To
6) Output the following results:
a) Total words checked
b) Number of words found
c) Number of words not found
d) Total time required for the searches
e) Average time required per search
Note that a single run of your
program must test all of the algorithms for a given dictionary file and input
file. The number of words found and not
found should be the same for each of your dictionary implementations – you
should use this fact to test the correctness of each implementation[7]. Test your program using each of the following
dictionary files: words1.txt
words2.txt
words3.txt
and with the following text files: medium.txt
large.txt. Note that, depending on the speed of your
computer, the run-times for these input files will vary, and they could be
fairly long, especially for the largest dictionary file and the worst
dictionary implementation. Keep in
Once you have completed your test runs, write a short (~2 page, double-spaced) paper summarizing your results. In your paper, be sure to include:
· A table showing the results for all of your runs. Indicate average time per search vs. number of words in the dictionary.
· Comparison of the total search time and average time per search for each of the different search algorithms. Discuss how they compare as the dictionary file size increases and which algorithm is preferable overall (in real terms – i.e. if an actual "spell check" of a file were being done – would all implementations even be feasible for a real spell checker?).
·
Discussion for each algorithm of how the
asymptotic run-times and the actual measured run-times compare. To do this you will need to deter
Important Notes:
[1] Note that the parameter <String> allows the superclass ArrayList to be specified for Strings only. This should not affect your subclass implementations. For more information on class parameters and Java generics, see http://java.sun.com/j2se/1.5/pdf/generics-tutorial.pdf.
[2] We will discuss primality in more detail soon when will look at RSA encryption.
[3] To save on computation during resizing, you can easily pre-calculate (off-line, prior to running the program) a list of prime numbers to be used for the table sizes and put them into an array. Since the table size doubles each time, even 20 values will allow very large tables to be generated (much larger than you will need). You can use a similar approach for the dhval values.
[4] So the execution of the program might be >java assig2 words1.txt large.txt
[5] Use the following delimiter string for your StringTokenizer: " \t\n\r\f.,!?\";:()[]"
[6] To avoid the truncation associated with integer division, be sure to cast one of your operands to double PRIOR to doing the actual division.
[7] I
will put the expected number of found and not found words (assu