Oct 23: Hash tables


Code discussed in lecture

Word Reductions

A word reduction is a series of words where each word is derived by deleting a single letter of the word before it, until there is only a single letter left. An example would be "yeasty", "yeast", "east", "eat", "at", "a". Any word that is the start of such a sequence is called reducible

We might like to know all of the reducible works in English, along with the longest such word. The program WordReduction.java is a way to answer these questions. It also demonstrates the use of sets, string manipulation, and file I/O, so is a good example for those reasons.

How might we go about solving this problem? Let's call a reducible word a "good" word. If we knew all of the good words shorter than our current word, we could try dropping each letter in the current word and seeing if it is one of the shorter good words. If so, we have found a new good word.

This program reads in all of the words in an online dictionary. The one that I use is one on the Mac. There are many others on line. Because we want to process the words in increasing order of length we use a list of sets, where the the ith list item contains a set of words of length i+1.

We open the file using a FileReader and then use a Scanner to make it easy to process the words. The words in this dictionary appear one per line, so we read a line, map the word to lower case, get its length, and then add it to the appropriate set.

The method dropLetter demonstrates the use of the substring method on strings. The string positions of string str are indexed from 0 to str.length()-1, just like arrays. We can use charAt to select a particular character (which we did in several of the testing programs). To take a substring we give the start index and the index one beyond the final index. Therefore str.substring(1,2) has length 1 and str.substring(1,1) is the empty string.

We then process the strings in increasing order of length. We add "i" and "a" to the set goodWords of good words, and then work our way up from there (see the main method). We also go through and print the good words and the words that they derive from. Finally, we have a recursive method printReduction for printing out a reduction for any given word.

Hashing

HashSet and HashMap are stored using a data structure called a hash table. The idea can be illustrated by how the Sears catalog store in Lebanon used to keep track of catalog orders to be picked up. They had 100 pigeonholes behind the order desk, numbered 0 to 99. When customers arrived the clerk would ask them for the last two digits of their phone numbers. Given a number he would look in the corresponding pigeonhole for the order form. Even if there were several hundred order forms in the system each pigeonhole contained only a few forms, and he would search through these. This effectively split a set of hundreds of items into 100 sets of a few items each. The trick was to find a hash function that would assign each order form to a particular pigeonhole in a way that spread the order forms fairly evenly among them. The last two digits of the customer's phone number worked well for this purpose. (Why would the first two digits be a bad choice?)

We can do a similar approach inside the computer. For a map, we want to use the key of the (key,value) pair to figure out where we are going. For a set we use the object itself. We need two things:

We will discuss how to organize the table first. The easiest way create a table is to have an array of list structures, called a bucket array. Usually these are linked lists, to make it easy to add items or delete items. For this reason the approach is called separate chaining. This is very similar to what Sears does. If multiple items hash to the same place in the table they all go into the list.

How many items do we expect to look at when searching for a item? For unsuccessful search (it wasn't in the map or set), we would look at everything in the list. But how many things is that? If the table has N slots and there are n items stored in it and the hash function spreads things evenly, that would be n/N items per slot on average. This ratio is called α, the load factor of the table.

For successful search we know we will find the item, and on average we will go through half the list before we do so. So successful search takes about 1 + α/2 comparisions and unsuccessful search takes α comparisons, on average. If N = Θ(n), then this is a constant. (Demonstrate on a small table, maybe 11 long, and some data items.)

That sounds great, but if you had the bad luck to have everything hash into the same slot then the time required would be Θ(n). That cannot be avoided. If an adversary puts n*N items into the table then one of the slots will have at least n items in it. He or she then makes those n items the data for the problem that you are dealing with, and you are stuck. (There is an idea called universal hashing, which basically computes a different hash code every time you run the program, so that data that is slow one time might be fast the next.)

Note that if n gets much larger than N then search time goes up. How can we avoid this? The same way that we do for arraylists. When the table gets too full we create a new one about double the size and re-hash everything into the new table. What is "too full"? Java implementations typically start the table with size 11 and double the table size when α gets bigger than 0.75.

The only problem with this is that the table uses a lot of space. We need references to the lists, and the lists have links or possibly empty places in the arraylists. This can be a problem on an embedded system or handheld device. An alternate approach is to save the data directly into the array and to do away with the lists. This is called open addressing. As long as we don't have 2 items hash into the same slot we are fine. Surely we can do this.

Actually, we can't. Birthday "paradox". (Demonstrate in class.) The probability that two people in a group of 26 people have the same birthday is about a half. "Randomly" distributed keys don't do what you might expect. If you "randomly" distribute n keys into a table of size n, about n/e of the slots are empty (where e = 2.71..., the base of the natural logarithm). So over a third of the slots are expected to be empty! The expected length of the longest list is O(log n/ log log n). (To see how to compute these take CS 30 or come by office hours.)

So how do we handle collisions? Easy way - linear probing. Just start marching along the table until you find an empty spot. Then when searching for an item you start at the spot that it hashes into, and go until you find the item or an empty place in the table. (Demonstrate.)

Problem - the odds of hashing into the empty spot after a block of size k of full spots is (k+1)/N. So the long blocks are likely to get longer. What is worse - blocks coalesce. When a block grows into another block, the two become one huge block. The result is that unsuccessful search time (meaning that you must keep looking until you find an empty spot) grows very rapidly as α gets close to 1. For these open addressing appraoches you want to make sure that you keep α less that 0.5, so at least half the table is empty.

There is another problem - deletions. If you just delete the item and mark the spot it was at as empty, then a search for an item whose hash value is to the left of the item but which is stored to the right of the item (because of collisions and linear probing), then you will never find it. You will see the empty spot and quit searching. Need an AVAILABLE marker that allows insertion of new items, but does not stop a search. And then when you compute the value of α your n is not the number of items currently in the table, it is the number of items plus the number of AVAILABLE markings.

To avoid the problem of clustering the best thing to do is double hashing. First hash function tells you where to start looking. Second tells you how big a step to take (so between 1 and N-1). For this to work the table size N has to be prime. Does a good job of eliminating clustering. Book discusses.

Computing Hash Functions

The problem now is how to compute the mapping from keys to spots in a table. This mapping is what the hash function does. A good hash function has three desirable properties:

Java computes hash functions in two steps. The first is to have every object have a hashCode() method. This returns an int. The default is often the address of the object in memory. Then we have to map this integer into an entry in the table. This is called the compression function.

A simple compression function is to take the hashcode mod the table size. This works well when the table size is a prime, but not so well otherwise. In that case the book suggests computing [(ai + b) mod p] mod N, where i is the hashcode, p is a prime > N, and a and b are chosen at random between 1 and p-1 (or can also let b be 0). But this is taken care of in the implementation of Java's library.

Java gives us the responsibility to come up with a good hashcode. In particular, if you define equals for an object it is very important that you override hashCode so that two items considered equal have the same hashcode. This is because if you insert an item in the table and then look for it using something that is equal to the item you expect to find it. If the hashcodes are different you won't find it. You will be looking in the wrong bucket.

So how can you compute hash codes, particularly for composite objects (have several instance variables or are strings or arrays or ...). A simple option is to sum all of the pieces, or sum the hashcodes of the pieces. However, this means that shuffling the order (say of a string) does not change the hashcode. It probably should! A better idea for computing the hash code of what we can view as a tuple (x0, x1, ..., xk-1) is to pick a positive number a (book suggests that 33, 37, 39, and 41 are good choices) and compute the polynomial x0ak-1 + x1ak-2 + ... + xk-1. Book shows how to do this using Horner's rule, which basically means starting with a running total at x0 and for j = 1 to (k-1) multiplying by a and adding xj.

public int hashcode() {
  final int a = 37;
  
  int sum = x[0];
  for(int i = 1; i < k; i++)
    sum = a * sum + x[i];
    
  return sum;
}

The book also shows a way to use cyclic shifts for this purpose. For the curious, there is a lot of literature on hashing, including Knuth's Art of Computer Programming, Vol. 3.