CS 1501

Hashing Summary

Basic Idea:

Use a hash function to map keys into the index range of a hash table. Call the hash function h(x), let the index range be 0..(M-1), where M is the size of the hash table, and let the table be an array, T, of some item type. Note: A more proper object-oriented implementation would have the table be an object with the various operations as member functions. However, we will keep it simple in this discussion.

Simplistically, now an Insert(T, item) function will be as follows:

i = h(item);

T[i] = item;

And a Find(T, item) function will be as follows:

i = h(item);

if (T[i] == item) found

else not_found

Based on these simplistic functions, it is clear that the hash table operations will have O(1) run-times, assuming h(x) can be calculated in constant time. We will see that the the actual run-times are still O(1) in the average case, but can degrade to O(N) in the worst case.

Complicating Factors:

The key space (possible values of all potential keys), K, is usually larger in size (often much larger) than the table size, M. Note that this does not mean that the actual number of keys, N, is larger than M (which cannot be true for the Open Addressing schemes described below). For example, if an employer wants to use Social Security Numbers as keys, even if he/she has only 100 employees (N = 100), the key space is still the number of possible Social Security Numbers (K = 29), since the employer may not know in advance the SSN of each employee he has (or may have in the future).

If (K > M), by the Pigeonhole Principle, we know that we cannot have a mapping of each potential key into a unique index in the array. Thus we have to possibility of a collision.

Define a collision to be the situation in which

h(x1) == h(x2), where x1 != x2

With the possibility of collisions, our simplistic hash table functions above fail. We must modify them to handle, or to resolve collisions.

Reducing Collisions:

Although most often we cannot eliminate collisions, we can try to reduce the probability of their occurrence. The best way to do this is to choose the hash table size, M, to be a prime number and to choose h(x) in an intelligent way.

If keys are hashed to "randomly" selected locations in the table, collisions will only occur due to random chance, based on the number of possible keys and the degree to which the table is already filled. If, however, patterns develop in the hashing of keys, the potential for many collisions occurs and the performance of the hash table will degrade quickly. An example of a poor hash function is, for example, using the first three digits of a telephone number for Pitt Students. Since many students have the same leading phone number digits (683, 624, 648, etc) and since some exchanges are not allowed (ex. 911, 411), there will be many collisions at some table locations and others will have a 0 probability of being hashed to. Using the last three digits is just as simple, yet much more effective, since the last three digits of a phone number tend to be more "random" than the first three.

A simple, effective hash function for random integers is

h(x) = x mod M

For more complicated keys (such as strings, for example), we can think of the hash function in two steps:

    1. Using all of the key (or as much as is practical), convert the key into a large integer
    2. Use the simple hash function for random integers to find the index in the table

It is important that the value as well as the position of each character in the string be taken into account (otherwise permutations of strings would hash to the same locations). This can be done by multiplying the ASCII value of each character by a different power of some constant (the number of total characters is a good choice). To avoid overflow due to high powers of relatively large integers (for example, with the simple ASCII set, 128 would be the base, and 1285 is already past the range of a 32 bit integer) we can use Horner's Method, as described on p. 232-234 of the Sedgewick text.

Resolving Collisions

There are two different general approaches to maintaining hash tables:

    1. Place keys directly into table locations. If an Insert collision occurs at a location, clearly the new item must be placed into a different location (or a different address). This approach is called open addressing, since the address actually used may be different from the one selected by h(x).
    2. Make the hash table an array of some other collection ADT. Thus, the value selected by h(x) simply indicates which collection to use for the Insert (or Find, or Delete) function. Though the name is not commonly used, we could call this technique closed addressing, since the index selected by h(x) is always the one that is used.

Open Addressing Schemes

The simplest open addressing scheme is linear probing. Using linear probing, when a collision occurs at location h(x), simply increment the index (mod M) until the collision is resolved. The actual resolution differs depending on the operation:

      1. the key is located, indicating found
      2. an empty location is discovered, indicating not_found

Note that for both operations, if the table is full (N == M) there is the possibility that the search may cycle back to the original location, h(x). The operations should account for this possibility, but, in reality, we should not let the table get even close to being full.

Define the load factor, a , to be the ratio of the number of items in the table to the table size, or

a = N/M

As a ® 1, linear probing degenerates very quickly toward O(N) for the hash table operations, due in part to the phenomenon of clustering. Ideally, for a random key, a good hash function and an empty table, the probability that it will be hashed into any location in the table should be the same, or

p(h(x)=i) = 1/M for all i, 0 <= i < M

However, once a location is filled, using the linear probing resolution technique, an Insert hashed to the filled location will be placed in the location immediately following. Extending this idea, if a cluster of successive locations in the table becomes filled, an Insert hashed to any of those locations will be placed at the first open location after the cluster. Thus,

p(iC) = (C+1)/M

where p(iC) is the probability that a new item will be placed into location i after a cluster of length C. Thus, as clusters get longer, the probability that they get even longer increases.

Define a probe to be an access of a table location. As clusters increase in length, the average number of probes needed for Insert and Find operations also increases. This is especially evident for unsuccessful Finds, since they will only terminate at an empty location (at the end of the cluster). Thus, as a increases, cluster length increases and thus the run-times for Insert and Find increase, approaching O(N) for a close to 1.

An alternative to linear probing is double hashing. The idea behind double hashing is to make the probing increment dependent upon the key (rather than just being 1, as with linear probing). This way, when a location in the table becomes filled, an Insert collision on that location will NOT necessarily be placed in the next location, but rather some increment down the table. This reduces clusters, since an Insert collision in location i could result in the item being placed (ideally) in any empty location, rather than the first empty location after i (as with linear probing).

The double hashing increment can be calculated in various ways, but the idea is to make it independent of the first hash function, so that two keys that collide with h(x) will still generate different double hashing increments, and they will not collide on successive probes. The text discusses some ways to generate effective double hash increments. An easy way to see the similarities and differences between linear probing and double hashing is to note that linear probing is really a degenerate case of double hashing, with h2(x) = 1 for all x.

For double hashing to work and be effective, we must guarantee that

    1. The increment will never be 0 (otherwise we would never check more than one location). This usually is not difficult to do.
    2. Every location in the table will be probed once before any is probed twice. If this condition does not hold, we could utilize only part of the table and the performance would suffer. For example, if M = 16 and h2(x) = 4, only 4 distinct locations in the table would ever be probed (try it to see). To ensure that all locations in the table will (potentially) be tried, we must be sure that the h2(x) value is always relatively prime with the table size. Since a prime number is relatively prime with any other number, making the table size itself prime is a simple way to ensure this condition.

For light load factors, there is little performance difference between linear probing and double hashing. However, as a gets bigger (ex. 0.6, 0.7, 0.8) a marked performance difference between the two methods becomes evident. This difference has been investigated analytically and empirically, and formulas for the average number of probes for each method are shown in the text. Note, however, that as a ® 1, both techniques break down and performance approaches O(N). Thus, when using open addressing in general, it is a good idea to keep your table from getting too full.

Deletion with Open Addressing Schemes

A shortcoming of both open addressing schemes is the Delete operation. Recall that the Find function terminates unsuccessfully when it encounters an empty location in the table. However, if that empty location was the result of a deletion, the data may actually be present, further down the table. For example, consider the tables below, and assume that empty locations are marked with the value -1. Assume further that Data X, Y and Z all hash to location i.

Index

Data

 

Index

Data

i

X

 

i

X

i+1

Y

 

i+1

-1

i+2

Z

 

i+2

Z

A Find for key Z in the left table would proceed through location i and i+1 and succeed at location i+2. However, if key Y is deleted from the table, as shown in the right table (and the location remarked with -1), a Find for key Z would terminate unsuccessfully at location i+1.

To "fix" this problem we can have three states for each table location: empty, full and deleted. This way, a deleted location will terminate an Insert (a new item may be placed there) but it will not terminate a Find (it would proceed to the next location). Besides complicating the implementation, this "fix" has the drawback that, after long use, a hash table will eventually have few if any empty locations left. Since unsuccessful Finds only stop at empty locations, the time for unsuccessful Finds will again approach the worst case of O(N). To "fix" the "fix" we can periodically "rehash" everything in the table, reinitializing the locations to empty and full. However, this also has a lot of overhead, and, practically speaking, we may not want to use open addressing schemes if Delete will be needed.

Fortunately, there are applications for which Delete is not needed. For example, a compiler may keep track of declared identifiers with a hash table. Variable declarations cause an Insert and variable uses cause a Find. If the Find is unsuccessful, the compiler gives you the friendly "Undefined symbol" error, meaning you tried to use a variable that you did not declare. This can be done without requiring delete (local tables can be used for sub-blocks and the entire table destroyed when the block terminates).

Closed Addressing - Separate Chaining

Recall that with closed addressing the hash function will merely select an individual collection ADT from an array of collection ADTs. With this approach some of the problems inherent to open addressing go away:

    1. Large a values do not seriously degrade the hash table performance. Since each index represents a collection of items, a higher load factor simply means that the individual collections will have more items in them on average. Thus, performance will gracefully degrade as a increases, even if it is larger than 1.
    2. Delete is not a problem at all. In this case we are simply doing all of our operations (Insert, Find and Delete) on a collection ADT which can handle them in an appropriate way. The hash table simply selects the collection to use.

However, having an array of collection ADTs adds considerable overhead in itself, especially with regard to memory use. Thus, it would be prudent to make the collections simple, easy to implement, and low in memory use (for example, an array implemented ADT would not be a good idea, since we would have an array of arrays, which could require quite a lot of memory). Separate chaining, in which each ADT is an unsorted linked list, is a good choice.

In separate chaining the hash function simply chooses the linked list, and then a linked list Find, Insert or Delete is performed. Since the list is dynamic, we will only allocate memory for the items that are inserted into the list, and will not waste a lot of memory (just an extra pointer for each item). From previous experience, we know that unsorted linked lists do not give good performance for Find, Insert or Delete (O(N) in the average and worst cases). However, the goal in this case is to keep the lists very short by hashing keys throughout the table. Thus, the added memory of using a sorted array (so binary search can be used for Find) or the added complexity (and overhead) of using a binary search tree at each index is unjustified. For example, how much time would be saved doing a binary search of an array of size 3 over doing a sequential search of a linked list with 3 nodes?

When separate chaining is used, we still do not want the load factor to be too high, but as long as it is a small constant the hash operations should perform quite well, O(1) in the average case. Note that the worst case is still O(N), even with separate chaining. This is because there is still the chance that all (or most) keys could be hashed to the same location, degenerating our hash table into an unsorted linked list! However, with a good hash function the chances of the worst case actually occurring are very low.

Conclusions

If implemented properly, a hash table can give us O(1) Find performance in the average case, which is much faster than any of the other searching algorithms we have considered. However, despite this speed advantage for Find, it is not always the method of choice. For example, we often want to be able to access our data in sorted order (alphabetically, numerically, etc.). A sorted array or a binary search tree provide this ability with little cost -- we could list the data in order in O(N) time for either. However, remember that what makes a hash table effective is the fact that there is NO ORDER to the data in the table. Thus, we would have to do a full-fledged sort of the data, requiring O(NlogN) time (after first copying all of the data from the table to a second array). Yet if our dominant operation is Find, it is hard to do better than hashing.