CS-199 Big Data Homework 3

Predicting Crime Rates:

In this homework, you will apply regression methods to predict crime rates using US census data. The UC Irvine machine learning dataset repository has a collection of data predicting per capita crime rates at this URL.I have tried to structure the homework so that you can do it with minimal programming skills, but more programming will produce a more rewarding exercise. You may do the homework in groups of up to four, though I expect more ambitious work from larger groups. It is due 8 April 2014. Submit a PDF by email, naming all group members, explaining your results and how you got them on each step, and presenting all relevant details. Please ensure the email has CS199-BD in the subject.

Potential problems:

On the web page, you will see that some variables are marked "not predictive". Drop these explanatory variables. You will also notice that some variables have missing values for some instances. You will need a strategy to deal with missing values. In the first instance, try dropping explanatory variables that have missing values. You should evaluate your regression by (a) looking at the mean-squared error on the training data and (b) splitting off some test data and looking at the mean-squared error on that test data.

Basics: trying a linear regression:

You should build a linear regression of violent crimes per population against explanatory variables that don't have missing values. Check your regression --- are there problem data points? which ones, and why? Does a Box-Cox transformation make things better?

Basics: trying nearest neighbors regression:

Does a nearest neighbors regression work better? Again, just drop data points where there are missing values. Does it help to rescale the variables?

Dealing with missing values by imputing them:

Now we look at the explanatory variables that have missing values. Impute these values using nearest neighbors (just like filling in textures). Now apply a linear regression --- did things get better? you have more data, but some of the values are imputed.

Modified nearest neighbors:

Modify the nearest neighbor method so that it takes into account only variables that have values. Does this improve the predictions?