PS-1, due Oct 2

Before you do anything, please read the entire assignment, and think carefully about your approach. To borrow an epigram from the carpenter's trade, think twice, code once. You will have a much easier and more fun time with all the assignments in this class if you work that way.

For this assignment, you are to work alone. You may not share code with anyone else, nor may you look at anyone else's code. You may of course ask the course staff for help, and are encouraged to do so if you are stuck.

Overview

You again will add methods to Picture.java to transform images. You will first write a method that takes a list of colors and returns a version of the picture that uses only those colors. A pixel's color is replaced by the "nearest" color to it in the list.

Why might we want to do this? One reason is to get a "poster" effect. It is cheaper to print a poster with only a few colors of ink, so by reducing the number of colors we make the picture look like a poster. For example, the beach.jpg file looks like this:

A "posterized" version using 8 colors looks like this:

Of course, the quality of the image depends on the set of colors chosen. The wrong choice of colors will lead to an image that is very different from the original. How can we select a "good" list of colors?

You will implement a method called k-means that selects a good set of colors. The process will be described later.

There is a more practical reason for reducing the number of colors in an image: image compression. A standard color requires 24 bits = 3 bytes to represent it: 8 each for red, green, and blue color components, and we need one of these for each pixel in the image. However, if we have only 8 colors, we save them in a list of length 8. We can then represent a pixel's color using only 3 bits to store the position in the list that contains the desired color. The original image takes almost 8 times as much space as the posterized version. (A small amount of space is needed to store the color list.)

Size of images was very important when memories were small and expensive. The original Mac had 128 KB main memory and 400 KB floppy disks. The photo above is 640x480 pixels. If stored with 24 bits = 3 bytes of memory per pixel, it would require 921,600 bytes, or 900 KB. It would be too big to fit on two floppy disks! The posterized version would require less than 113 KB. This is part of the reason that the screen on the original Mac was black and white, requiring a single bit per pixel. Even after there were color Macs I remember getting asked to choose to whether images should be represented with "hundreds of colors" or "millions of colors."

Memory is cheaper today, but compression is still important when loading images in web pages over relatively slow connections. The original version would take 8 times as long to load as the posterized one.

Of course, there are other ways of compressing images. There are two types of compression, lossless and lossy. Lossless means that every pixel has the exact 24-bit value that had in the original image. The TIFF format is a lossless compression format. The beach picture requires about 592 KB using TIFF, so is about two thirds the size of a file saving all of the 24 bit colors directly.

Lossy compression lets the color values be "close". Posterizing is a lossy compression. With only 8 colors it is very lossy. But using 256 colors ("hundreds of colors" in old Mac parlance) reduces the size to a third of the original and produces a reasonably accurate image. This picture uses 251 colors.

beach rendered in 251 colors

A better method is JPEG, which is based on Fourier transforms and other compression techniques. The beach.jpg file only requires 72 KB, and is a much better representation of the original photo.

Implementing JPEG is not a reasonable first assignment for a second CS course. Implementing color reduction is.

Problem

Modify the image to use only colors from a given list

Add a method public Picture mapToColorList(ArrayList<Color> colors) to Picture.java. This returns a new picture where each pixel in the original picture is replace by the color in colors that is nearest to it. How should you compute "nearest"? Treat the two colors as points in three dimensions (red, green, and blue axes) and find the distance between them. The static method Pixel.colorDistance(Color color1, Color color2) performs this computation.

Run your method on an arrayList containing Color.red, Color.green, Color.blue, Color.cyan, Color.orange, Color.yellow, Color.black, and Color.white. It should not be a good representation of the image.

Try to find a set of colors that works better. Can you find a set of eight colors that looks as good as or better than the eight-color posterization given above? To do this, you display the picture p with p.explore() in Dr. Java, and then make a visual judgement of what eight colors to pick. As you know, when you mouse over the picture, the coordinate of the pixel under the mouse and the values of red, green, and blue for that pixel are displayed on the picture window. So, you can note down the RGB values for the eight colors that you think would be best to render the picture with, and then use these RGB values in your program to define the color list. (Only spend a few minutes doing this - the purpose is to let you realize that picking good colors is not as simple as it might appear. You will not lose points if you image is not very good, but it should be better than the one using the eight colors listed above.)

Generate a random color list

An alternate way of choosing a color list is randomly. Write a method to generate a random color list. Call mapToColorList with random lists of length 8 and 256.

To generate a random color you need to generate a random red value, a random green value, and a random blue value. Each should be an integer in the range 0-255. To do this you need a random number generator. One choice is to use Math.random(), which generates a number that is between 0 and 1 (including 0, not including 1). An alternative is to use the java.util.Random class, described on p. 111-112 of our textbook.

Use k-means to compute a good color list

Your mapToColorList method takes each pixel in the image, extracts its color, and finds the closest color in the color list. (I hope that you wrote a separate helper method to find the closest color!) We use this approach to group the pixels. For each color in the color list we can find all pixels in the image whose color is closer to that color than to any other color in the list. Actually, for our purposes here we don't care about the pixels themselves, only about the colors that they contain. If we view the colors as points in red-green-blue color space, then the colors of pixels assigned to a given color from the color list will form a cluster around or near that color.

Given a cluster of colors associated with a color, can we do better than assign all of them to the corresponding color from the list? In particular, is there a color that will better represent that cluster of colors? In the 8-color output that you were asked to generate above the sky ended up white. That was because white was the closest color to light blue. But light blue would have been a better choice for those sky pixels.

How could we discover this? We could take the centroid of all of the colors in the cluster. The centroid is obtained by finding the average red value, the average green value, and the average blue value and creating a color with these average values. That color is in some sense near the center of the cluster, so may be a better representative of the cluster than the original color. Therefore we can create a new color list out of the centroids of the clusters.

If we now call mapToColorList using this new list of colors, will we color each pixel with the centroid of the cluster that it was originally assigned to? Not necessarily. The centroid color may be further from some pixel colors than the original color was. They may now be closer to a different centroid, and would be colored that color.

Note that we are back to the problem that we started with. We have a new list of colors. We can re-cluster the pixel colors according to that list. But the centroid of each new cluster may be a better representative color than the centroid obtained from the old cluster. So we can repeat the process of re-clustering and computing the centroids.

When do we quit? When the clusters do not change. That means that the old list of centroids is identical to the new list of centroids. We can then use this list as the list to pass to mapToColorList.

This process is called k-means. Mean is another word for average, and k is the number of clusters formed (which should be equal to the number of colors in the original color list). This is how I got the eight colors used in picture shown below. I also used it to generate the 251 colors used in the other picture. (I started with 256 colors, but some clusters became empty during the process.)

You should write a method

public ArrayList<Color> computeColors(int number)

(where number is the desired number of clusters) that returns a list of that many colors computed using k-means. For the basic assignment you may return a list with fewer than number colors if there are empty clusters formed during the k-means process.

You should also write a method

public Picture reduceColors(int number)

that calls computeColors and passes the result to mapToColorList. (This can be a one-line function.)

Some things to Consider

Representing the clusters

An easy way to save the colors in a cluster is in an ArrayList. But you have a whole list of clusters. Therefore what you will end up with is an ArrayList of ArrayLists. Its type delaration will be the rather imposing

ArrayList<ArrayList<Color>> clusters;

Remember that you have to first initialize clusters by calling the constructor new ArrayList<ArrayList<Color>>(). You then have to add the appropriate number of empty ArrayList<Color> lists to clusters.

If you do things this way, you will have what is called "parallel structures." You will have an ArrayList of colors, and the clusters structure given above. They are associated, because the color with index j in the list of colors corresponds to the ArrayList with index j in clusters.

(An alternate approach is to create a class to contain a color and its associated cluster. You are welcome to do this, and it is a purer OO design. However, in this case it leads to lots of moving colors from an ArrayList to objects in the new class and then back to an ArrayList and is probably more trouble than it is worth.)

Initial sequence

The process above starts with an initial list of colors. It does not specify where this list of colors comes from. One possibility is to generate random colors. Another is to take the first k unique colors from the list of pixels in the image. (You should verify that there are no duplicates in the list. Duplicate colors will lead to empty clusters.) Try both of these approaches and comment on which works better and why.

Your testing may be simpler if you write an overloaded version of computeColors that takes an initial color list:

public ArrayList<Color> computeColors(ArrayList<Color> initColors)

You can then pass it whatever initial color list you like, rather than modifying the code to call different methods for generating color lists.

Testing ArrayLists for equality

If current and next are two ArrayLists, then

current == next

will only be true if the two references are to the same ArrayList (same memory locations). In class I said that you should not use == (or !=) on objects. You instead want to use equals:

current.equals(next)

If current and next were arrays instead of ArrayLists it would do a shallow compare (asking if the items in the arrays were ==) but for ArrayLists it tests the corresponding pairs of items using equals. (p. 120 of the book discusses this problem in arrays and the way around it.)

Empty clusters

It is possible that when you re-cluster a color may end up with an empty list of associated pixel colors. You can't take the centroid of an empty list. You may just drop that color, meaning that the new list of colors that you generate will be shorter than the list you started with.

Writing helper methods

Writing small helper methods makes your code easier to read and debug.

Recursive solutions

The algorithm above is easy to write using an iterative loop - keep updating clusters until the new clustering is the same as the current one. However, you can also do this recursively. If you are thinking of doing so, read on. Otherwise skip to the next section.

If you try to solve the problem recursively you are very likely to run out of memory. The recursion you do is a special type of recursion called tail recursion, where the very last thing you do is to return the value of the recursive call. If you were in a language like Scheme or Haskell that recognizes tail recursion and converts it to iteration you would be fine. Unfortunately Java compilers do not recognize tail recursion.

Every time that you call computeColors you create a new copy of the variable clusters. Each clustering contains something like 250,000 colors, each of which takes 4 bytes. So each recursive call creates a new megabyte of data. After a few thousand recursive calls you are at 3 GB. It could easily take a few thousand calls before you converge on 256 clusters.

The solution to the problem is to allow Java to garbage collect clusterings that are no longer needed. Setting clusters to null just before the recursive call to computeColors will allow Java to realize that you never will use that clustering again and it will garbage collect it.

Speeding up the computation

The k-means computation can be relatively slow, particularly with large pictures and long color lists. For your early debugging runs I suggest using a smaller picture (e.g beach-smaller.jpg) and a small number of colors (e.g. eight).

The distance function Pixel.colorDistance computes a square root. This is a relatively slow operation, and it is computed a huge number of times. (OK, to be more exact, the number of pixels times the length of the color list each reclustering, and potentially a lot of reclusterings.) But you don't care about the true distance, only relative distances. Therefore you can instead write a method that computes the square of the distance (similar to Pixel.computeDistance, but without calling Math.sqrt) and use that squared distance to decide which color is closest to a pixel. You can also make the variables and return type int instead of double, because integer multiplication and addition is faster than floating point multiplication and addition.

Also, if you are trying to compute a large number of colors on a large picture, an alternative is to compute the colors using a reduced picture (e.g. beach-smaller.jpeg) and then to use those colors in mapToColorList on the larger picture (e.g. beach.jpeg). It may not be quite as good as using k means on the larger picture, but it is likely to be close.

Extra Credit

Better initial color selection

Neither method of generating initial lists of colors is very good. Choosing random colors can result in colors a long way from any pixel in the picture, and those colors are likely to have empty clusters. The first k unique pixel colors are likely to come from the top row of pixels, and are likely to be similar and to not represent the diversity of colors in the picture. This leads to slow convergence for k-means. Invent a better way of choosing the initial color list, test it, and explain why it is better.

Dealing with empty clusters

Empty clusters lead to fewer than the desired number of colors. Using fewer colors leads to poorer images than you could have gotten. Figure out a way to replace a color whose cluster is empty with a useful color, so that the number of clusters does not decrease.

Improve the process in some other way

Think of another way to get better images with a small number of colors. "Better" can mean closer to the original, or just an interesting effect.

Turn in

Submit via Blackboard your Picture.java code, your comment on whether k random colors or the first k colors works better (and why you think this is), and screen shots of your output from your various methods on the image beach.jpg. Include the following screen shots:

mapToColorList called on beach.jpg using the 8 colors given above.
mapToColorList called on a list of 8 colors that you generate by hand. (We should be able to tell by looking at your code or in some other way what your choice of 8 colors is.)
mapToColorList called on lists of random colors of length 8 and 256 (2 images).
k-means starting with random initial color lists of length 8 and 256 (2 images).
k-means starting with the first 8 and the first 256 unique colors in the image (2 images).

Name all of the image files in a way that makes it clear how the image was generated.

If you do extra credit you should electronically hand in appropriate screen shots to demonstrate what you did.

You should comment your code. Modify the header comment to describe your modifications and add your name as an author. Add Java Doc comments to describe what methods do and what the parameters and return values are.

Late assignments will be strongly penalized, so please start early.

Grading rubric

Total of 120 points

Correctness (80 points)

15	Given a list of colors, create an image using only those colors.
10	Generate lists of random colors
15	Correctly cluster the colors given a color list
15	Compute the centroids of a list of clusters.
10	Repeat until the list of color does not change
10	Create the initial list by taking the first k unique colors in the picture.
5	Comments about relative advantages of random initial lists and first k unique colors initial lists.

Structure (15 points)

15	Good decomposition and use of helper methods.

Style (15 points)

2	Comments at top
5	Comments for methods (purpose, parameters, what is returned)
5	Good names for methods, variables, parameters
3	Layout (blank lines, indentation, no line wraps, etc.)

Testing (10 points)

10	The eight screen shots requested above, appropriately labeled.

Extra Credit

The numbers given below are the maximum number of points you can get for each part. How many you get depends on how good your solution is.

15	Better method of generating initial lists
15	Method of dealing with empty clusters
30	Better methods of generating colors