D.A. Forsyth --- 3310 Siebel Center

A cranky female duck mallard, I think tries to grab a piece of cheese sandwich from a child in a punt.

TA's:

MCS-DS version of the course
- Krishna Kothapalli Lead
- Daniel Calzada
- Taiyu Dong
- Shreya Rajpal
- Ehsan Saleh
Online version of the course
- Binglin Chen
In person version of the course
- Christopher Benson
- Ji Li
- Maghav Kumar

Homework 4: Due 5 Mar 2018 23h59 (Mon; midnight)

You may do this homework in groups of up to 3 contributors. Groups of 1 or of 2 are just fine, too. A group can consist of any mixture of any type of student (MSCS-DS/Online/Face). We do not offer coordination services for complex group interactions, and you may want to take this into account when forming your group.

Submission: Homework 2 submission details TBA.

You may use any programming language that amuses you for this homework.

Problem 1

You can find a dataset dealing with European employment in 1979 at http://lib.stat.cmu.edu/DASL/Stories/EuropeanJobs.html. This dataset gives the percentage of people employed in each of a set of areas in 1979 for each of a set of European countries. Notice this dataset contains only 26 data points. That's fine; it's intended to give you some practice in visualization of clustering.

Use an agglomerative clusterer to cluster this data. Produce a dendrogram of this data for each of single link, complete link, and group average clustering. You should label the countries on the axis. What structure in the data does each method expose? it's fine to look for code, rather than writing your own. Hint: I made plots I liked a lot using R's hclust clustering function, and then turning the result into a phylogenetic tree and using a fan plot, a trick I found on the web; try plot(as.phylo(hclustresult), type='fan'). You should see dendrograms that "make sense" (at least if you remember some European history), and have interesting differences.
Using k-means, cluster this dataset. What is a good choice of k for this data and why?

Problem 2

Do exercise 6.2 in the Jan 15 version of the course text

Questions about the homework

Can we use linear vector quantization functions like lvqinit, lvqtest, lvq1 available in the 'class' library in R for this exercise?
Answer sure; don't know how well they work for this, as haven't used the package. For this one, it may be simpler to build your own than to understand the package
How should we handle test/train splits?
Answer You should not test on examples that you used to build the dictionary, but you can train on them. In a perfect world, I would split the volunteers into a dictionary portion (about half), then do a test/train split for the classifier on the remaining half. You can't do that, because for some signals there are very few volunteers. For each category, choose 20% of the signals (or close!) to be test. Then use the others to both build the dictionary and build the classifier.
When we carve up the signals into blocks for making the dictionary, what do we do about leftover bits at the end of the signal?
Answer Ignore them; they shouldn't matter (think through the logic of the method again if you're uncertain about this)