Oct 30: Binary Search Trees
Code discussed in lecture
Ordered Maps and Dictionaries
Hash tables are fast (on average) and quite useful, but they have no
concept of order. If you iterate through a HashSet
you get
items in the order that they appeared in the hash table, which is fairly
random. However, a TreeSet
will return the elements in the
set in increasing order by compareTo
. There are cases where
it is very useful to be able to do some other operations on a map that
depend on the ordering in the set. Such a map is called an ordered map.
The book suggests the following additional operations. (In the book the
first two took a key as a parameter. I think that this is a typo.)
- firstEntry(): return the entry with the smallest key value.
- lastEntry(): return the entry with the largest key value.
- ceilingEntry(k): return the entry with the least key greater than or equal to k.
- floorEntry(k): return the entry with the greatest key less than or equal to k.
- lowerEntry(k): return the entry with the greatest key less than k.
- higherEntry(k): return the entry with the least key greater than or equal to k.
In any case above if no such entry exists the method returns null.
Note that this gives a way to interate through the map in order. Start with firstEntry(). Take the key k out of the entry and call higherEntry(k). Repeat. To iterate in reverse order use lastEntry and lowerEntry.
How do we represent an ordered map? Two obvious choices are a sorted array (or arraylist)
and a sorted doubly linked list. Either will implement the operations above in time Θ(1).
With the sorted array you have a fast get
via binary search.
(If you have not seen binary search let me know!) Unfortunately, put
and
remove
are O(n) time, because you need to move elements after the desired position
to make room or close a hole. A doubly linked list has Θ(1) put
and
remove
(if you know where), but O(n) get
. Actually, finding the place
to insert of delete in general will take O(n) time, also.
An obvious question is whether there is a way to combine the advantages of both data structures. The book gives two ways to do this. The first is skip lists, which we will come back to later in the term. The other is binary search trees, which we will cover next.
Dictionaries
The book makes a distinction between a map and a dictionary. Unfortunately many programming
languages (including Python) and textbooks call what we call a map a dictionary. But by our
book's definition a
dictionary can have multiple instances of the same key. Then get
returns some
value paired with a key, getAll
returns an interable collection of all the
values paired with a key, and remove
removes one of the (key,value) pairs.
They note that Java does not provide an implementation of this ADT.
There is a good reason why Java does not supply an implementation. The usual thing to
do is what they do when they implement the Dictionary ADT in Java. Instead of a Map pairing a
key with a value use a Map that pairs a key with a list or set of values. Then
dictonaryPut(key,value)
checks
to see if key is in the map. If not it puts the key paired with a singleton list containing value.
If the key is in the map it adds the value to that key's list. This should be familiar
to you from the short assignment where the keys are state names and the values are sets of cities
within the corresponding state.
Binary search trees
A binary search tree (BST) is a data structure combines the best features of a sorted array and a linked list. If we have our whole list, we can construct a tree that branches the same way that binary search does. Start with the middle element. Put it at the root of the tree. Pick the middle element of the left half and make it the left child of the root. Similarly the middle element of the right half is the right child of the root. (Write a list, demonstrate.)
We can now do binary search by following references. Compare the key that we are looking for with the root element. If they are equal we are done. If the key is less than the root recursively search the left subtree. (We don't have to do this recursively - we could just keep a pointer to which element we were at and change it to reference the left subtree.) If it is greater search in the right subtree. We will find it or reach an external node.
The book has a somewhat complex binary search tree program, so we will look at a simpler one:
BST.java. Look at the find
method.
We can also easily modify a BST on the fly (compare to inserting
into a sorted array for binary search). Inserting is very similar to
searching. If we find the key, we replace the value with the new
one. If we reach a leaf, we add the new element there. (That is where it belongs - any search
for it in the future will follow the same path through the tree and will find it.)
We insert the element in the leaf and add two external nodes below it. Look at insert
.
Also, note the method of dealing with inserting debugging code that can be turned on and off
by setting a constant to true or false.
Deletion is a bit trickier. Find the thing to delete. Easy if either child is a Leaf —
point parent at the other child (which may also be a Leaf). Preserves
order. If two children, have a problem. Can't point parent at both.
Re-inserting one subtree leads to unbalanced trees. Solution: replace
thing with two children by immediate predecessor or successor in the
tree. (Thus we get same answer as original when searching for any
item in the tree.) Immediate successor is the smallest thing in the
right subtree. Find smallest by going left until left child is a
Leaf. (How does this relate to finding the next item in inorder on the
sample exam?) But that means that this node has a leaf as a child, so can be
easily deleted. The delete
method does this.
So what is the time required for all of these operations? They all are O(height of the tree), because they follow a single root-to-leaf path (or quit early in the case of a successful search). But what is the height of the tree?
In general, a bunch of inserts and deletes can lead to an unbalanced tree — more nodes along one side than the other. In the worst case, we end up with a linear structure (a list represented as a tree) and get a vine instead of a tree. In that case the tree has height Θ(n). But if we have random insertion order the height of the tree is expected to be Θ(log n). Proving this is something that gets done in CS 30 or 31.
Sometimes we don't want to risk a bad run time when the data is not random. (Especially because a bad case is when the data is inserted in increasing or decreasing order!) To ensure balance requires more bookkeeping and occasional fix-ups; later we will see a type of BST called a red-black tree that handles that.