PS-5, due Nov 13

Social Networks, BFS, and the Kevin Bacon Game


Important organizational notes

For this assignment, you are permitted to work with one other student currently in the class. You do not have to work with someone else, but you have the option of doing so. If you choose to work with a partner, you will both get the same grade on the assignment.

There will be no penalty, in terms of points, for working together on this assignment. Please make sure that both of you submit the code electronically. When you submit on Blackboard, be sure to state the name of your partner in the comments section.

You should weigh whether you will get more out of this assignment working alone or with someone else. The choice is up to you.

If you choose to work with someone else, pick your partner carefully. Make sure that there are times that you are both available to work together. If you frequent the lab and you notice someone else who often is there when you are, that person might be a good choice as your partner.

General Instructions

Before you do anything, please read the entire assignment carefully. It contains many suggestions which can save you a lot of struggle later on.

Background Information on Social Networks and the Kevin Bacon Game

A social network is a set of people with some rules for determining pairs who are "related" to one another. Social networks can be represented as graphs, with the people as vertices and the "related to" relationship defining the edges. Examples are Facebook and MySpace, where the vertices are people who have created web pages in the system and the edges are friend relationships. These relations are symmetrical, so the resulting graph is undirected. In Twitter the relationship is "follows," which is not symmetrical, so the resulting graph is directed.

In this problem you will write a program to play the Kevin Bacon game. The vertices in this network are actors and the edge relationship is "appeared together in a movie". The goal is to find the shortest path between two actors. Traditionally the goal is to find the shortest path to Kevin Bacon. The following output from the sample solution shows how the game is played:

To quit the program, type return in answer to a question.
Enter the name of an actor: Diane Keaton
Diane Keaton's number is 2
Diane Keaton appeared in Hanging Up (2000) with Meg Ryan
Meg Ryan appeared in In the Cut (2003) with Kevin Bacon

Enter the name of an actor: Buster Keaton
Buster Keaton's number is 5
Buster Keaton appeared in Limelight (1952) with Claire Bloom
Claire Bloom appeared in Haunting, The (1963) with Julie Harris
Julie Harris appeared in Requiem for a Heavyweight (1962) with Mickey Rooney
Mickey Rooney appeared in Erik the Viking (1989) with Tim Robbins
Tim Robbins appeared in Mystic River (2003) with Kevin Bacon

Enter the name of an actor: 

So based on the data set we supply for this problem, Diane Keaton's Bacon Number is two, and Buster Keaton's Bacon Number is five.

A geekier example is Erdos Numbers. Paul Erdos was a Hungarian mathematican who published over 1500 papers, most of which were co-authored. He spent his life traveling from university to university, living on speaking honoraria and the hospitality of those he visited. He would give talks and collaborate on research. The vertices of this graph are mathematicians and scientists, and the edge relationship is, "were co-authors on a scholarly paper". One's "Erdos number" is the number of papers (edges) one must pass through to get to Paul Erdos. For more on Erdos numbers, see http://www.oakland.edu/enp/.

Programming Exercises

Problem 1: Implementing Breadth-First Search

The easiest way to play the Kevin Bacon game is to do what is called breadth-first search (BFS) in the movie data graph. This builds a tree of shortest paths from every actor who can reach Kevin Bacon back to Kevin Bacon. Or more generally, given a graph G and a root vertex r in G, BFS builds a "shortest-path tree", which consists of all vertices of G that are reachable from r. This tree has the property that, for every vertex v in the tree, v's parent in the tree is the first vertex that appears on a shortest path in G from v to the root r.

To implement BFS we use a queue. We also need an undirected graph, which can be represented using the class AdjacencyListGraphMap. The result of our BFS is the shortest-path tree described above. You will be developing a class DirectedAdjListMap for SA-10, and should use this to represent the tree.

The pseudocode describing BFS is:

  Insert root into an empty queue Q and into a new directed graph T

  Until Q is empty
    dequeue Q to get next vertex v to process
      for each edge e that is incident to v in G
        Let v' be the other end of the edge
        if v' is not in T
          add v' to T and add an edge with the same label as e from v' to v in T
          enqueue v' in Q
  return T

When you are done, T holds a shortest-path tree, also known as BFS tree. To find the Bacon number of an actor, look the actor up in T. If there is no vertex for that actor in T, then the actor is not connected to the root. If the actor is there, follow edges of T back to the root, printing movies (edge labels) and actors (vertices) along the way.

Test your function on a graph with the following vertices and edges:

vertices: 
  "Kevin Bacon", "actor1", "actor2", "actor3", "actor4", "actor5", "actor6"
  
edges: 
  ("Kevin Bacon", "actor1", "movie1")
  ("Kevin Bacon", "actor2", "movie1")
  ("actor1", "actor2", "movie1") 
  ("actor1", "actor3", "movie2") 
  ("actor3", "actor2", "movie3") 
  ("actor3", "actor4", "movie4")
  ("actor5", "actor6", "movie5") 

What to Turn In

In the folder that you will zip and submit via Blackboard, include (1) your code that implements BFS, (2) testing code that creates the graph specified above and runs BFS on it, (3) screenshots showing (i) the graph created and (ii) the digraph (directed graph, which is the tree) returned by BFS.

Problem 2: Reading Information and Creating Actor/Movie Graph

You are now ready to create a module that plays the Kevin Bacon game. The first thing to do is to read movie data files and create the graph. This data was download from: http://www.cs.luther.edu/~bmiller/CS151/Spring05/index.html. You will read three files: actors.txt, movies.txt, and movie-actors.txt. They are large files - actors.txt contains 9,235 actors, movies.txt contains 7,067 movies, and movies-actor.txt contains 21,370 movie-actor pairs, resulting in 32,337 edges in the final graph. So while you are developing your program, use the smaller files actorsTest.txt, moviesTest.txt, and movie-actorsTest.txt, whose data represent the graph specified earlier. These six files can be downloaded in a single zip file: ps5data.zip.

The files are all formatted the same way. Each line has two quantities separated by a "|". In the actors file the quantities are actorID and actorName. In the movies file they are movieID and movieName. In the movies-actors file they are movieID and actorID, indicating that the actor associated with actorID appeared in the movie associated with movieID.

Use the file contents to build a graph whose vertices are labeled with actor names (not IDs). Create an edge between two actors if they appeared in the same movie, and label that edge with the name of that movie. You should assume that no movie appears twice in the movies file and that no actor appears twice in the actors file. It is OK for there to be multiple edges between a pair of actors if they appeared together in multiple movies.

You may find it useful to create maps for mapping IDs to actor names and IDs to movie names. You can also use a map to figure out which actors appeared in each movie, and can use that information to add the appropriate edges to the graph. This may take a little thought, but try it by hand on the small data set given above (the files with "Test" in their names).

What to Turn In

In the folder that you will zip and submit via Blackboard, include (1) your code that reads the files and produces the graph, and (2) screenshot showing the graph created from the files moviesTest.txt, actorsTest.txt, and movies-actorsTest.txt.

Problem 3: Implement the Bacon Game

You now have what you need to implement the Bacon game. Perform BFS on the graph with "Kevin Bacon" as root and save the BFS tree returned. Then ask for a series of actors. For each one, print out the path between that actor and the source, or say that none exists. This will require following a path from the chosen actor in the BFS tree back to the root. (See above for an example of how this might be formatted.) If the user gives a name that is not in the original graph (I mean the original graph, and not the tree), say so and prompt again.

Test your program on the movieTest.txt, actorTest.txt, and movie-actorTest.txt files. Make sure to demonstrate that your program works for boundary conditions.

When you are sure that your program works on the test data, change it to use the movie.txt, actor.txt, and movie-actor.txt files. Demonstrate that your program works for these as well.

What to Turn In

In the folder that you will zip and submit via Blackboard, include (1) your code that plays the game, and (2) screenshots of the test runs for both the Test files and for the large files.

Extra Credit

  1. Extra credit: It is annoying that if you don't get the exact form of the name you will not find the actor. For example, if you type "Timothy Allen" instead of "Tim Allen", "Kate Blanchett" instead of "Cate Blanchett" or "Katherine Hepburn" instead of "Katharine Hepburn" you will not find the actor. Devise some sort of scheme for finding "near matches" and reporting near matches to the name entered, so the user can re-enter the correct form of the name.

  2. Extra credit: Statistical analysis: Compute some interesting statistics about this graph. Some possible examples: Find an actor with the largest finite Bacon number for this data set. Find the average Bacon number for those actors with finite Bacon numbers. Find the destination actor that minimizes either longest path length or average path length from all other actors.

What to Turn In

If you choose to do the extra credit problems, in the folder that you will zip and submit via Blackboard, include:

  1. Your code for finding "near matches" and a demonstration that it works.
  2. Your code for computing the various statistical measures, along with a demonstration that it works on the Test graph and your results for the real graph.

General directions for turning in your homework

Submit via Blackboard the zip file of a folder that contains the items for Problems 1, 2, and 3, as already explained.

Grading rubric

Total of 120 points

Correctness, Elegance, and Efficiency (90 points)

25BFS works correctly
10Reads data files
20Creates the graph
15Interactive interface with the user (read names, print results)
15Traces the path from an actor back to Kevin Bacon
5Handles boundary cases correctly.

Structure (12 points)

7Good decomposition into objects and methods.
2Proper use of instance and local variable
1Instance variables are private
2Proper use of parameters

Style (8 points)

2Comments for classes
2Comments for methods (purpose, parameters, what is returned) in JavaDoc form.
2Good names for methods, variables, parameters
1Layout (blank lines, indentation, no line wraps, etc.)
1Other unspecified style issues

Testing (10 points)

5Include all of the test runs specified above.
5Demonstrate that the code works for boundary cases

Extra Credit

20Write a near-match finder
5Compute the actor with the largest Bacon number
5Compute the average Kevin Bacon number (for those with finite Bacon numbers)
10Find the actor who minimizes the maximum distance to any other actor
10Find the actor who minimizes the average distance to all other actors
?Other interesting additions