Oct 30: Binary Search Trees

Ordered Maps and Dictionaries
Binary Search Trees

Code discussed in lecture

BST.java

Ordered Maps and Dictionaries

Hash tables are fast (on average) and quite useful, but they have no concept of order. If you iterate through a HashSet you get items in the order that they appeared in the hash table, which is fairly random. However, a TreeSet will return the elements in the set in increasing order by compareTo. There are cases where it is very useful to be able to do some other operations on a map that depend on the ordering in the set. Such a map is called an ordered map. The book suggests the following additional operations. (In the book the first two took a key as a parameter. I think that this is a typo.)

firstEntry(): return the entry with the smallest key value.
lastEntry(): return the entry with the largest key value.
ceilingEntry(k): return the entry with the least key greater than or equal to k.
floorEntry(k): return the entry with the greatest key less than or equal to k.
lowerEntry(k): return the entry with the greatest key less than k.
higherEntry(k): return the entry with the least key greater than or equal to k.

In any case above if no such entry exists the method returns null.

Note that this gives a way to interate through the map in order. Start with firstEntry(). Take the key k out of the entry and call higherEntry(k). Repeat. To iterate in reverse order use lastEntry and lowerEntry.

How do we represent an ordered map? Two obvious choices are a sorted array (or arraylist) and a sorted doubly linked list. Either will implement the operations above in time Θ(1). With the sorted array you have a fast get via binary search. (If you have not seen binary search let me know!) Unfortunately, put and remove are O(n) time, because you need to move elements after the desired position to make room or close a hole. A doubly linked list has Θ(1) put and remove (if you know where), but O(n) get. Actually, finding the place to insert of delete in general will take O(n) time, also.

An obvious question is whether there is a way to combine the advantages of both data structures. The book gives two ways to do this. The first is skip lists, which we will come back to later in the term. The other is binary search trees, which we will cover next.

Dictionaries

The book makes a distinction between a map and a dictionary. Unfortunately many programming languages (including Python) and textbooks call what we call a map a dictionary. But by our book's definition a dictionary can have multiple instances of the same key. Then get returns some value paired with a key, getAll returns an interable collection of all the values paired with a key, and remove removes one of the (key,value) pairs. They note that Java does not provide an implementation of this ADT.

There is a good reason why Java does not supply an implementation. The usual thing to do is what they do when they implement the Dictionary ADT in Java. Instead of a Map pairing a key with a value use a Map that pairs a key with a list or set of values. Then dictonaryPut(key,value) checks to see if key is in the map. If not it puts the key paired with a singleton list containing value. If the key is in the map it adds the value to that key's list. This should be familiar to you from the short assignment where the keys are state names and the values are sets of cities within the corresponding state.

Binary search trees

A binary search tree (BST) is a data structure combines the best features of a sorted array and a linked list. If we have our whole list, we can construct a tree that branches the same way that binary search does. Start with the middle element. Put it at the root of the tree. Pick the middle element of the left half and make it the left child of the root. Similarly the middle element of the right half is the right child of the root. (Write a list, demonstrate.)

We can now do binary search by following references. Compare the key that we are looking for with the root element. If they are equal we are done. If the key is less than the root recursively search the left subtree. (We don't have to do this recursively - we could just keep a pointer to which element we were at and change it to reference the left subtree.) If it is greater search in the right subtree. We will find it or reach an external node.

The book has a somewhat complex binary search tree program, so we will look at a simpler one: BST.java. Look at the find method.

We can also easily modify a BST on the fly (compare to inserting into a sorted array for binary search). Inserting is very similar to searching. If we find the key, we replace the value with the new one. If we reach a leaf, we add the new element there. (That is where it belongs - any search for it in the future will follow the same path through the tree and will find it.) We insert the element in the leaf and add two external nodes below it. Look at insert. Also, note the method of dealing with inserting debugging code that can be turned on and off by setting a constant to true or false.

Deletion is a bit trickier. Find the thing to delete. Easy if either child is a Leaf — point parent at the other child (which may also be a Leaf). Preserves order. If two children, have a problem. Can't point parent at both. Re-inserting one subtree leads to unbalanced trees. Solution: replace thing with two children by immediate predecessor or successor in the tree. (Thus we get same answer as original when searching for any item in the tree.) Immediate successor is the smallest thing in the right subtree. Find smallest by going left until left child is a Leaf. (How does this relate to finding the next item in inorder on the sample exam?) But that means that this node has a leaf as a child, so can be easily deleted. The delete method does this.

So what is the time required for all of these operations? They all are O(height of the tree), because they follow a single root-to-leaf path (or quit early in the case of a successful search). But what is the height of the tree?

In general, a bunch of inserts and deletes can lead to an unbalanced tree — more nodes along one side than the other. In the worst case, we end up with a linear structure (a list represented as a tree) and get a vine instead of a tree. In that case the tree has height Θ(n). But if we have random insertion order the height of the tree is expected to be Θ(log n). Proving this is something that gets done in CS 30 or 31.

Sometimes we don't want to risk a bad run time when the data is not random. (Especially because a bad case is when the data is inserted in increasing or decreasing order!) To ensure balance requires more bookkeeping and occasional fix-ups; later we will see a type of BST called a red-black tree that handles that.