Nov 14 (x): Balanced BSTs


Balanced Binary Search Trees

We have seen binary trees, and in particular binary search trees. We have seen that insertion, search, deletion, finding the max and min in a tree can be done in time proportional to the height of the tree.

However, as we also saw, a binary tree can have height O(n) in the worst case. This occurs, for example, when all items are inserted in increasing order. (There are many other such configurations, too).

So, how do we insure that this doesn't happen?

First idea: - Each time you insert or delete, "fix up" the tree so that it is always as close to balanced as possible. - Problem: This isn't as easy as it looks. Consider this tree:

50 / \ 30 70 / \ / 20 40 60

This is balanced just fine. Now, insert 10. A perfectly balanced tree with these keys would look like this:

40 / \ 20 60 / \ / \ 10 30 50 70

Note that not a single parent-child relationship is preserved between the original tree an the new, balanced one. So, it's possible that keeping maintaining balance could take O(n) time, on each update! Well, maybe not every update, but every second one. Consider deleting 70 and inserting 0, then deleting 60 and inserting -10, .... That's no good, since the whole point was to make the dictionary operations run fast.

Potential Solutions. We have a binary search tree, we could:

  1. Give up "binary" -- allow nodes to have varying numbers of children. Leads to 2-3-4 trees, B-trees.
  2. Give up "perfect" -- keep it "close" to perfectly balanced. Leads to AVL trees, Red-Black trees.

AVL trees keep the difference in height between the left and right subtrees of any node to be 0 or 1.

We will first look at 2-3-4 trees, and then relate them to a structure called red-black trees, which are easier to implement.

2-3-4 Trees

The book calls these (2,4) trees. Both names are used.

Intuition:

  1. We'll allow multiple keys to be stored at each node.
  2. A node will have one more child than it has keys: leftmost child -- all keys less than first key next child -- all keys between first and second keys ... etc ... last child -- all keys greater than the last key a a:b a:b:c / \ / | \ / / \ \

Rules for 2-3-4 Trees

  1. Every node has either 2, 3, or 4 children (1, 2, or 3 keys).
  2. All the leaves of the tree (either external nodes or null pointers) are on the same level.

To insert, you first go as far down the tree as you can. So we build a tree by inserting 41, 38, 31:

41 -> 38:41 -> 31:38:41 -> / \ / | \ / / \ \

In general, go down to a node with external children, and if a 2 or 3 node, expand to a 3 or 4 node, adding the new key in the right location.

If the node that you get down to is a 4-node, SPLIT it into two 2-nodes, promote the middle key to join the parent node (creating a new root if the current root is split), and then continue following down into the correct 2-node. Note that promoting the middle key to the parent may add a fourth key to the parent, which then has to split, etc.

Now let's insert 12:

- No room in the 4 node, so split: 38 38 / \ / \ 31 41 now insert 12 --> 12:31 41 / \ / \ / | \ / \

Now let's insert 19: No problem, goes into the 3 node

38 / \ 12:19:31 41 / / \ \ / \

But now if we insert 8:

- No room in the 4 node, so split, promoting 19 19:38 19:38 / | \ / | \ 12 31 41 now insert 8 --> 8:12 31 41 / \ / \ / \ / | \ / \ / \

Now insert 15 (unproblematic), and 17 (one split):

19:38 19:38 / | \ / | \ 8:12 31 41 -- 15 --> 8:12:15 31 41 -- 17 --> / | \ / \ / \ / / \ \ / \ / \ 12:19:38 / | | \ 8 15:17 31 41 /\ /|\ /\ /\

Now insert 33 and 35 (unproblematic):

12:19:38 12:19:38 / | | \ / / \ \ 8 15:17 31 41 8 15:17 31:33 41 /\ /|\ /\ /\ /\ /|\ /|\ /\ 12:19:38 / / \ \ 8 15:17 31:33:35 41 /\ / | \ / / \ \ /\

Now insert 20 (drives a double split):

- We want to split 31:33:35, so we can insert 20, but we cannot, because 12:19:38 is already a 4-node (full). So first, we must split that one: 19 /------+------\ 12 38 << after first split (root) / \ /-----|-----\ 8 15:17 31:33:35 41 /\ / | \ / / \ \ / \ 19 /------+------\ 12 33:38 << after second split / \ / | \ 8 15:17 31 35 41 /\ / | \ /\ /\ /\ 19 /------+------\ 12 33:38 / \ / | \ 8 15:17 20:31 35 41 << 20 finally inserted /\ / | \ /|\ /\ /\

Problems with 2-3-4 nodes

This is tricky to implement. You could:

The big advantage binary nodes have is that they're simple and small; you always need at least TWO children anyway, so you can keep track of what's on either side of a particular key.

Around 1984, Guibas and Sedgewick came up with the following clever idea:

2-nodes: a -> a / \ / \ 3-nodes: a:b -> a OR b (Here a double // or \\ means / | \ / \\ // \ that the child is RED) b a / \ / \ 4-nodes: a:b:c -> b / / \ \ // \\ a c / \ / \

So, basically we have encoded each 2, 3, or 4 node using only binary nodes, but we've painted each node with a "color", either red or black. We paint a node red if it's part of a "bigger" node. This is called a red-black tree.

See tree above after inserting 33. Would look like:

12:19:38 / / \ \ 8 15:17 31:33 41 << 2-3-4 tree version /\ /|\ /|\ /\ 19 // \\ 12 38 / \ / \ 8 17 31 41 /\ // \ / \\ /\ 15 33 /\ /\

So, now, one simple way to think about Red-Black trees is to just apply the 2-3-4 tree rules, bearing in mind this encoding.

Red-Black Properties

But, you can also think of a Red-Black tree as a structure separate from a 2-3-4 tree. The rules are:

  1. Every node is either RED or BLACK; by convention, the empty tree is a leaf, and is always considered BLACK.
  2. The root is black. If an action colors it red, change it back to black.
  3. For any node which is RED, both of its children are BLACK (this means: No two consecutive red nodes along any path)
  4. Every path from any node to a descendant leaf has the same number of black nodes.

Claim: The Red-Black properties (1-4) mean that the depth of the tree is O(lg(n)) if there are n nodes in the tree.

Informal justification:

- Since every path from the root to a leaf has the same number of black nodes (by property 4), the shortest possible path would be one which has NO red nodes in it. - Suppose k is the number of black nodes along any path from the root to a leaf. Q: How many red nodes could there be? A: At most k. By property 3, anytime you get a red node, your next node must be black. You can only do that k times before you run out of black nodes. So, the LONGEST possible path is at most 2 times the length of the shortest, or h <= 2k It can be shown that if each path from the root to a leaf has k black nodes, there must be AT LEAST 2^k - 1 nodes in the tree. (We showed this when we looked at binary trees - 1 node at root, 2 nodes at level 1, 4 nodes at level 2, etc. Add them up.) Since h <= 2k, i.e. k >= h/2, ... there must be at least 2^(h/2) - 1 nodes in the tree. If there are n nodes in the tree, that means n >= 2^(h/2) - 1 Adding 1 to both sides: n + 1 >= 2^(h/2) ...and taking the log (base 2) of both sides: lg(n+1) >= h/2 or 2 lg(n + 1) >= h, which is O(lg(n)).

Inserting into a Red-Black Tree

As usual for a binary search tree, find the location where the new element goes, and insert it there. Initially, color it RED. This insures that rules 1, 2 and 4 are preserved. But rule 3 might be violated (a red node has black children). There are several cases (assume we're inserting a key x).

  1. A 2-node (black node with black children) | | | a ==> a or a (becomes a 3-node) / \ // \ / \\ x x This is not problematic; no violation of rule 3.
  2. A 4-node (black node with red children) | a // \\ b c / \ / \ << x to be inserted at one of these locations

    Here, we have to "split" the node, promoting the middle element to a higher point in the tree. But that's easy: Just join (a) with its parent, and "unjoin" (b) and (c) from (a). This amounts to a "color flip", e.g.:

    || a / \ b c / \ / \\ x << wherever x is, it's okay

    We haven't changed the black height of anything; the new node is RED, and we just switched the order of red/black in the subtree. So we haven't broken the other properties, and now rule 3 is okay here. But we must make sure the property hasn't been violated at a higher point in the tree! We'll come back to this.

  3. A 3-node (black root with one red child). This case is a bit trickier. It depends where you are inserting: | | a a // \ or / \\ b <3> <3> b / \ / \ <1> <2> <2> <1>

    If you are inserting at <3>, no problem; you're basically creating a 4-node, adding x at the front or the back. No color problem:

    | | a a // \\ or // \\ << simple case, insert at <3> b x x b / \ / \ / \ / \

    But if you insert at <1> or <2>, under the child, you get two red nodes in a row. So we have to fix that.

    Suppose you insert at position <1>. Then, you have two red nodes in a straight line:

    | | a a // \ or / \\ b <3> <3> b // \ / \\ x <2> <2> x

    Note in this case that x < b, and b < a. We could fix this operation by "rotating" this whole structure to move b up to the root of the structure, colored black with x on one side (colored red) and a on the other side (colored red):

    | | b b // \\ or // \\ x a a x / \ / \ <2> <3> <3> <2>

    Note that we haven't changed the number of black nodes along any path by doing this, and we've fixed the double red. And the sub- trees <2> and <3> are still ordered properly.

    This is called a single rotation.

    On the other hand, suppose you inserted at position <2>. Then you have two red nodes in a row, but in a "zig-zag" pattern, e.g.:

    | | a a // \ or / \\ b <3> <3> b / \\ // \ <1> x x <1>

    Now, the "middle" element is x, so we basically want to move x up to the root, and have b and a on either side, both colored red:

    | | x x // \\ or // \\ b a a b / \ / \ / \ / \ <1> <3> <3> <1>

    One way to think of this is to suppose that you do it as two single rotations -- once around b, and then around x. (Here it is shown for the left case only; the other one is symmetric:

    | | | a a x // \ // \ // \\ b <3> ==> x <3> ==> b a / \\ // \ / \ / \ <1> x b <1> <3> / \ <1>

    This is called a double rotation. The resulting 4-node will be the same regardless of which 3-node version you started from.

    The book combines both of these cases into what they call a trinode restructuring. This captures the basic idea - you are turning the three nodes of an invalid 4-node into a nice, balanced, valid 4-node. This is the operation that prevents a tree from getting too vine-like. Note that everything under the bottom node moves up a level in the tree.

    Go through all of the 2-3-4 trees and show how the operations act on red-black trees.

    Tree Maintenance

    Since a red-black tree is also a binary search tree, the time complexity of the `lookup' operation is O(h), which we just argued is O(lg(n)) in the worst case.

    Similarly, insertion works like in a binary search tree, except that we also have to fix up the color properties. But, even in the worst case, you only have to fix colors along the path from the new node to the root, so that's only a constant factor of additional work.

    Furthermore, it can be proved that you will never have to do more than one rotation or double-rotation to fix up the tree. All the other changes can be done with color-flips. So, insert is also O(lg(n)).

    Deletions

    Deleting from both these sorts of trees is a bit more tricky, but it doesn't increase the time complexity -- it can still be done in O(lg(n)) time, worst case. The additional complexity comes because there is one way of splitting nodes, but several ways of merging nodes together (which you sometimes have to do when deleting).

    The first step in any deletion is to get the key to be deleted into a node with only external children (at the bottom of the tree). We do this basically the same way that we did in binary search tree deletion: swap it with its immediate predecessor or successor. You find the immediate predecessor in the same way - into the left subtree and follow rightmost links until you get to a bottom node, and take out the rightmost key. Use it to replace the internal node to be deleted. The red-black tree equivalent is to do a normal binary tree deletion. That is, once we have a node with an external child we point the parent of the node being deleted at a child (the sibling of the external child).

    If the node that you are removing a key from is a 3 or 4 node you are done. In red-black trees this corresponds to removing a vertex that is red or whose child is red. In either case color the child black and you are done. So to delete 10 replace it by 8:

    10 8 / \ / \ 7:8 12:13 -> 7 12:13 / \ / | \ / \ / | \ 10 8 / \ / \ 8 12 -> 7 12 // \ / \\ / \ / \\ 7 13 13 / \ / \ / \

    The problem comes when deleting from a 2-node. Call this vertex v. This would leave a 1-node, which is not valid. In a red-black tree this creates a node that is double-black to indicate that this path is short one black edge, so this edge counts as two black edges. The cases are:

    1. If w is an adjacent sibling of v that is a 3 or 4 node, we adopt (steal?) a child of w and give it to v. To maintain the order we have to move a key from w up to their common parent and a key from the parent down to v. In a red-black tree this corresponds to a trinode restructuring or a slightly more complicated "adjustment". The book has the details. So to delete 7: 8 12 / \ / \ 7 12:13 -> 8 13 / \ / | \ / \ / \ 8 12 / \ / \ 7 12 -> 8 13 / \ / \\ / \ / \ 13 / \
    2. If all of v's adjacent siblings are 2-nodes we fuse v with a sibling. To maintain the order we need a key between them, which we pull down from the parent. If the parent is a 3 or 4-node this is easy. If not we have passed the underflow up to the parent node, which then must be resolved at that level by adopting or fusing. The fusing operation can be passed up the tree to the root. If the root loses its only key it disappears and the tree is one level shorter. In a red-black tree this results in a re-coloring that passes the double black up to the parent, where it must be resolved. Removing 3: 12 12 / \ / \ 7 20 -> _ 20 -> 12:20 / \ / \ | / \ / | \ 3 9 15 30 7:9 15 30 7:9 15 30 / \ / \ /\ /\ / | \ /\ /\ ? | \ /\ /\ 12 12 12 / \ /// \ / \\ 7 20 -> 7 20 -> 7 20 / \ / \ / \\ / \ / \\ / \ 3 9 15 30 9 15 30 9 15 30 / \ / \ /\ /\ / \ /\ /\ / \ /\ /\

    The ideas are straightforward, but the implementation can be a bit messy. The book covers it in detail.

    Synopsis

    Binary search trees are a good data structure for implementing the Dictionary ADT, but their performance is best when they are kept in balance.

    • Keeping "perfect" balance turns out to be too much work.
    • But we can make it work if we either give up "binary" (have search trees with variable numbers of children), or give up "perfect" (allow trees to be "somewhat unbalanced")
    • 2-3-4 trees: Use the first strategy.
      • All leaves are at the same level, all paths the same length.
      • But it is clumsy to implement due to variable-size nodes.
    • Red-Black trees: Use the second strategy.
      • Encode 2-3-4 nodes as "mini-trees", nodes colored red to indicate they are conjoined with their parent.
      • Use "rotations" and "color flips" to keep the tree in balance, takes no more than O(lg(n)) time. Also, only do one restructuring adjustment for a insertion or deletion. This is important for some algorithms that you will see if you take CS 31.
    • Doing this, we get guaranteed O(lg(n)) runtime for all the Dictionary operations -- insert, lookup, delete.

    B-Trees

    2-3-4 trees are a special case of multiway trees. A multiway tree of particular interest is the B-tree. A B-tree of order d has nodes with between d/2 (rounded up) and d children. So a 2-3-4 tree is a B-tree of order 4. The B-trees of even order are particularly nice, because insertions and deletions are handled in very similar ways to 2-3-4 trees, with splits and fusions.

    The place where a B-tree is particularly useful is in storing a TreeMap of size n that is so large that it must be stored on disk. Searching through a regular binary tree stored on disk would require O(log n) comparisons, but it is quite probable that most of the time getting a child's key would require reading a new disk block. Disk reads are many orders of magnitude slower than memory accesses, and O(log n) disk reads could be time-consuming. What is especially galling is that a disk read brings in a entire disk block, usually between 512 bytes and 2048 bytes, and we use only one key of maybe 4 bytes out of that block.

    An alternative is to store the TreeMap as a B-tree, whose order is the number of key/value pairs that will fit into a disk block. If the value is some sort of a refererence to where the data is stored on disk, as is usually the case, instead of getting a single key per disk read we get hundreds of them. The depth of the B-tree of order d is still O(log n), but the base of the logarithm is d instead of 2. This is a case where constants matter, because disk reads are so slow compared to all of the other computation that takes place. Even maps with terrabytes of keys can be stored in a few levels, so search takes only a few reads. (Note that if d is 100 and there are 1012 keys that the number of levels of the tree will be 6, as opposed to over 40 in a regular binary tree. This could result in a speedup of a factor of 6 or so in the program, which is worth doing.)