PS-4, due Nov 4

In this assignment, you will use Huffman encoding to compress and decompress files. You will get practice with binary trees, priority queues, maps, and file input and output.

Important organizational notes

Note because of the midterm next Thursday you have an extra week for this assignment.

For this assignment, you are permitted to work with one other student currently in the class. You do not have to work with someone else, but you have the option of doing so. If you choose to work with a partner, you will both get the same grade on the assignment.

There will be no penalty, in terms of points, for working together on this assignment. Please make sure that both of you submit the code electronically. When you submit on Blackboard, be sure to state the name of your partner in the comments section.

You should weigh whether you will get more out of this assignment working alone or with someone else. The choice is up to you.

If you choose to work with someone else, pick your partner carefully. Make sure that there are times that you are both available to work together. If you frequent the lab and you notice someone else who often is there when you are, that person might be a good choice as your partner.

Background

Early in the development of computational technology computer memory was expensive. Therefore storing large files was discouraged, and being able to compress files to half or 60% of their original size was important. Memory is much cheaper now, but we are developing smaller hand-held computers for use in cellphones and other devices. On these small devices we are once again encountering storage bottlenecks. We also communicate over low-bandwidth channels like dialup, DSL, wireless, etc. Here a smaller file means a shorter download time. The proliferation of .zip files shows that file compression is still important today.

One of the earliest schemes for file compression was invented by Huffman. Instead of using 7 bits to encode each character, as ASCII does, it uses a variable-length encoding of characters. Frequently occurring characters get shorter code words than infrequently occurring ones. (A code word is the sequence of 0's and 1's used to encode the character.)

Huffman encoding gives the smallest possible fixed encoding of a file. A fixed encoding means that a given letter is represented by the same code wherever it appears in the file. Some more recent schemes that do better than Huffman's are adaptive, which means that how a letter is encoded changes as the file is processed.

Pages 579-581 of the textbook contain a description of how Huffman encoding works and even gives you psuedocode for how to build the Huffman code tree. Read those pages.

Huffman Encoding

A problem with variable-length encodings is figuring out where a code word ends. For instance, if we used '1' for 'E', '10' for 'F', '00' for 'X', and '01' for 'Y', we would be in trouble. When decoding 1001, we would be unsure whether it represents EXE or FY. Huffman Encoding removes this ambiguity by producing prefix-free codes. This means that for any given code word, adding bits to the end cannot produce another code word. Hence no code word is a prefix of another. When the computer observes a series of bits that corresponds to a code word, it knows that it cannot be part of a larger code word, and can safely interpret the series of bits as its corresponding character.

At this point the purpose and value or Huffman Encoding should be fairly clear. So how do we do it? The task is to generate a set of prefix-free codes whose lengths are inversely correlated with the frequency of the encoded character. There are two clever parts of the algorithm, the use of a binary tree to generate the codes, and the construction of the binary tree using a priority queue. Specifically we will construct a tree such that each character is a leaf and the path from the root to that character gives that character's code word, where a left child is interpreted as a 0 and a right child as a 1.

Generate a Frequency Table

The first step in creating the code for a given file is to learn what the character frequencies are. This is fairly simple. First we create a map with characters as keys and a frequencies as values. Then we read the file one character at a time. We add the character to the map if it is not there and increment the corresponding map value if it is there. At the end we will have a map that maps each character to the number of times that it appears in the file. (Note - in Java you will have to use appropriate wrapper classes to store the character and the frequency, because you cannot use primitives in a Map.)

Create Singleton Trees

We now create a size-1 (singleton) tree for each character. The root of each tree stores two values, a character and its frequency. We then add all these singleton trees to a priority queue. The priority queue is set up to return the tree with the lowest frequency when asked to remove.

Use the binary tree code we saw in class (BinaryTree.java) to implement the trees. Because you need to store two values and a tree node has only one data value you will have to create a class that stores a character and a frequency. This new class will be the data type for the tree. You should provide accessor methods as needed.

You can use whatever priority queue implementation you like. I found it easiest to use Java's PriorityQueue class, which implements a heap-based priority queue. Note that the things that you are comparing are the frequency counts in the root nodes of two trees. It is simplest to write a separate TreeComparator class that implements the Comparator interface. The only method that you need to implement is compare, which takes two tree nodes and returns -1, 0, or 1 depending on whether the first has a smaller frequency count, the counts are equal, or the second has the smaller frequency count. You pass a TreeComparator object as the second parameter of the PriorityQueue constructor. The first parameter is the initial size for the queue. Look up PriorityQueue in the Java documentation.

Tree Creation

Notice that all leaf paths in a tree are unique. We represent the paths as a sequence of bit values, '0' for following the left path to the left child and '1' for following the right path. This way we create a variable length bit code for each leaf-node character. Moreover, the bit sequence representing the path to one leaf can never be a prefix of a bit sequence of a path to another leaf. Thus, we get prefix-free codes, as desired!

In creating the tree we would like to have the lowest frequency characters be deepest in the tree, and hence have the longest bit codes, and the highest frequency characters be near the top of the tree. We use our priority queue to achieve this, as follows:

  1. We remove the two lowest frequency trees T1 and T2 from the priority queue.

  2. We create a new tree T by attaching T1 and T2 to a new root (i.e., we create a new root r and attach T1 and T2 as r's left subtree and right subtree, respectively),

  3. We assign to the new tree T a frequency that equals the sum of the frequencies of T1 and T2.

  4. We add the new tree T to the priority queue.

Notice that Steps 1-4 bring down the number of trees in the priority queue by one. We keep repeating the above four steps until only one tree remains in the priority queue. This tree is our "code tree".

We will not prove why this tree gives the most efficient codes for the characters, but this proof can be found online or in most algorithms textbooks. However, it should be intuitive that since we built up from the lowest frequency nodes, the lower a character's frequency, the deeper its leaf node will be in the tree.

Code Retrieval

The tree encodes all of the information about the code for each character, but given a character it is a bother to search through the tree to find it in a leaf and trace its path from the root. To make encoding fast we want a Map that pairs characters with their code words. That is, we want to pair each character with a string of '0's and '1's that describes the path from the root to that character.

You can construct the entire map during a single traversal of the tree. You just have to keep track of a "path so far" parameter as you do the traversal. I will let you work out the details. There are less efficient ways to solve the problem, but for full credit you need to produce the code map during a single traversal of the code tree.

Compression

To compress you will repeatedly read the next character in your text file, look up its code word in the code map that you computed above, and then write the sequence of 0's and 1's in that code word as bits to an output file. Reading and writing files is not hard, but is a bit detailed. This will be described below.

Decompression

Decompression is similar to compression, in that you read from one file and write to another. To decode you will run down the code tree until you get to a leaf. Start at the root and read the first bit from the compressed file. If it is a '0' go left and if it is a '1' go right. Repeatedly read a bit and go left or right until you reach a leaf. You have now decoded your first character. Get the character out of the data value of that leaf and write it to the output file. Then go back to the root and repeat the process. When there are no more bits to read, the decompression is completed. (Hopefully you just wrote out a character and returned to the root at this point. If not something is wrong.) You can compare your input file and decompressed file to verify that they are identical.

You may have noticed that we have cheated a bit - we kept the code tree that we used for compression around and later used it for decompression. In a practical compression application (e.g. zip) you will have one compress method and an independent decompress method. The code tree has to be somehow included in the file during compression so that it can be read and used for decompression. Doing this is extra credit (see below).

Reading and Writing Files

For compression you need to first read characters from a file and then write bits to a different file. For decompression you read bits from a file and write characters to a different file. Fortunately the Java library has classes that make it easy to read characters from a file and write characters to a file. Unfortunately the Java library does not have classes to read and write bits. To remedy this situation I have written classes to read and write bits and provided them to you. You don't have to read the code, although you might find it interesting to do so. The tricky part is that we can only read and write bytes to files, so if we have a number of bits that is not a multiple of 8 the last byte will be only partially full. How can we tell how may of those bits are useful and how many are garbage?

Reading and Writing Characters

For reading from a file I recommend that you use a BufferedReader. "Buffered" says that it reads a bunch of characters to a buffer and then doles them out one at a time as you ask for them, rather than going to the disk each time that you ask for a character. Disks always transfer a whole block of bytes every time they read or write, so the buffer speeds things up considerably.

To read a file start with the declaration and assignment:

BufferedReader input =  new BufferedReader(new FileReader(pathName));

where "pathName" is the full name of the file on your computer. (More on path names later.) This will open a file and save a buffer and information about the current position in the file within a BufferedReader object referenced by input. To read a character you call input.read(). The read method returns an int that holds the Unicode encoding of the character. You just have to cast it to be a char. When the file is empty read returns the value -1.

To write to a file, start with the declaration and assignment:

BufferedWriter output =  new BufferedWriter(new FileWriter(decompressedPathName));

where decompressedPathName is the name of the output file that you want to create. To write a character c you call output.write(c).

When you have finished reading or writing a file you should close it by calling close() (e.g. input.close() or output.close()). This frees up resources and cleans things up. If you don't close the output file the last buffer may not get written and the file can be left in an inconsistent state!

Reading and Writing Bits

I have supplied the classes BufferedBitReader.java and BufferedBitWriter.java. They are used in a manner very similar to the classes above. The way to open a reader and a writer is:

BufferedBitReader bitInput = new BufferedBitReader(compressedPathName);
BufferedBitWriter bitOutput = new BufferedBitWriter(compressedPathName);

To read call bitInput.readBit() and get back an int which equals 0 or 1 (or -1 when there are no more bits to read). To write call bitOutput.writeBit(bit), where bit is an int which had better be equal to 0 or 1.

There is also a close method for each of these classes. Call it when you are done reading or writing the file. If you do not close the output file I can guarantee that you will not be able to correctly read it later.

Handling Exceptions

All of the methods above may potentially throw an IOException. Because this is a checked exception you need to include try and catch blocks. Think carefully about where you want to catch errors. It may be easier to handle them in the main method than in a method that it calls.

You should also include finally blocks to close files.

Getting a File's Path Name

To get the complete path name of a file it is easiest to use a file chooser. The following method will return the path name of the file you choose from a standard file chooser window. The search starts in your home directory (so "/Users/gevorg" for me). That might be a good place to put your directory of test files!

 /**
   * Puts up a fileChooser and gets path name for file to be opened.
   * Returns an empty string if the user clicks "cancel".
   * @return path name of the file chosen	
   */
 public static String getFilePath() {
   //Create a file chooser
   JFileChooser fc = new JFileChooser();
    
   int returnVal = fc.showOpenDialog(null);
   if(returnVal == JFileChooser.APPROVE_OPTION)  {
     File file = fc.getSelectedFile();
     String pathName = file.getAbsolutePath();
     return pathName;
   }
   else
     return "";
  }

You will be dealing with three files: the original text file, the compressed text file, and the decompressed text file. I suggest that you get the original file path name from the chooser above and then generate the names of the other two by putting "_compressed" and "_decompressed" after the file name. So if your original file were "WarAndPeace.txt" the compressed file would be "WarAndPeace_compressed.txt" and the decompressed file would be "WarAndPeace_decompressed.txt". This makes it easier to keep track of the relationships between your various test files. (Look up the String class's substring method and use it to construct the new names.)

Testing

Create a small test file to help you debug your program. When you have your program working test it on two text files that I provide, USConstitution.txt and WarAndPeace.txt. Also create test files that will test boundary cases for your program.

Boundary cases are cases that somehow are "at the edge" of valid input that your program might have to handle differently than a general file. You should figure out what sorts of files might qualify. One case is an empty file! You should be able to compress and decompress it successfully.

Neither I nor the course staff is going to make a list of these cases for you or confirm that you have found them all. Figuring out what cases need to be tested is one of the things that we hope that you will learn during this course. If you are writing programs outside of course work you won't in general have anyone telling you what special cases you need to test.

For your convenience I have included both of these text files, BufferedBitReader.java, BufferedBitWriter.java, and BinaryTree.java it this zip file.

Extra Credit

You are now able to compress and decompress files using Huffman encoding. However, you used the same tree to both encode and decode. Normally the tree that you used to encode the file will not be around for you to use it to decode the file. For a practical system (similar to zip) the code tree has to be saved and reconstructed. Note that when you are decoding the frequencies do not matter, only the tree shape and the order of the characters at the leaves.

For extra credit implement a way to write out the code tree and then read it back in and regenerate it. You should write out the tree when you compress a file and read it in when you decompress a file. For the basic extra credit write a separate file to store the information needed to reconstruct the tree. For substantial additional extra credit write this information at the front of the file that contains the encoded characters, so that the file contains all of the information needed to decompress it.

When I say "write out the tree" I mean write out enough information to be able to regenerate the tree. One option is to use some sort of parenthesized notation that you then parse to reconstruct the tree. The trick of writing the tree in preorder and inorder could work, but you would need to generate unique names at the internal nodes that do not conflict with each other or with characters at the leaves. You could reconstruct the tree from frequency data, but you may have to be careful about the way that you deal with equal frequencies. I am sure that there are many other possibilities. You would like your representation to be compact, because the goal is to compress the file.

If you come up with other interesting extra credit ideas feel free to ask me if I will give extra credit for them.

Suggestions

This is not the kind of program where you can write out all the code, click the run button, select WarAndPeace.txt, and expect to be able to tell from the files printed where the bugs are in your program. Start with a short file (a couple of dozen characters), preferably one with a range of character frequencies. Use a System.out.println to print out the frequency map. (The Java Map implementations have a toString that gives reasonable output.) Then print the code tree. (You did override toString in your class that holds the character and its frequency, didn't you? The BinaryTree class supplies the rest of the toString to print an indented tree.) Print out the code map.

My sample solution currently has all of these print statements in it, commented out. I left them there because it is easier to uncomment them than to type them back in, and if something is not working or I am curious about something, I want to be able to get debugging output quickly and easily. Also, it is interesting to see the frequency numbers for WarAndPeace.txt and to see how much the code word length varies.

You can also put print statements to print out every character or bit read from files and written to files. However, it is a good idea to make sure these are commented out before you compress WarAndPeace.txt!

Acknowledgement

Some of the text for this assignment writeup is modified from a writeup by Delaney Granizo-Mackenzie

Blackboard Submission

To submit your work to Blackboard:
  1. Create the zip file of a folder that contains: (1) your java files, including any that I supplied, (2) the test files that you created and the output of your program run on them, and the results from USConstitution.txt. There should be two output files for each run, a "_compressed" file and a "_decompressed" file).

    Submit the zip file via Blackboard. In the "Comments" field, give the compressed size of WarAndPeace.txt, but do not include the output from WarAndPeace.txt.

    If you do extra credit make a separate zip file for it, the test files, the output of your program run on them.

    If you have a partner, each of you should submit via Blackboard. When you submit, make sure that in the comments section, you state the name of your partner.

  2. If you realize that you need to change some of your code after you've submitted, you can resubmit by following the above procedure. Once you start to resubmit, finish the process. If you abandon resubmitting in the middle of the process, Blackboard tends to show us that you did not submit. Furthermore, if you resubmit, include all of your .zip files, not just those that differ from the previous submission. We will look at only your last submission.

Grading rubric

Total of 125 points.

Correctness, Efficiency, and Elegance (80 points)

10Generates frequency map
10Creates a priority queue of singleton trees
15Builds the code tree
15Traverses the code tree to generate a Map from characters to code words
10Reads the input file, compresses it, and writes the compressed file.
15Reads the compressed file, decompresses it, and writes the decompressed file.
5Handles boundary cases correctly.

Structure (20 points)

10Good decomposition into objects and methods.
4Proper used of instance and local variable
2Instance variables are private
4Proper use of parameters

Style (15 points)

2Comments for classes
4Comments for methods (purpose, parameters, what is returned) in JavaDoc form.
3Good names for methods, variables, parameters
3Layout (blank lines, indentation, no line wraps, etc.)
3Other unspecified style issues

Testing (10 points)

5Output from USConstitution.txt and your small test case(s),
5Output from from your boundary case test(s).

Extra Credit

20Write code tree information to a separate file and reconstruct the tree
25Include tree information in the _compressed file (in addition to the part above).
?Other interesting additions