CS-498 Applied Machine Learning - Homework 5

CS-498 Applied Machine Learning

D.A. Forsyth --- 3310 Siebel Center

daf@uiuc.edu, daf@illinois.edu

13:00 - 14:15 OR 1.00 pm-2.15 pm (in old fashioned time)
WF
1404 Siebel Center

TA's:

Tanmay Gangwani gangwan2@illinois.edu

Jiajun Lu jlu23@illinois.edu

Jason Rock jjrock2@illinois.edu

Anirud Yadav ayadav4@illinois.edu

Office Hours

DAF: Mon 10-11 and Fri 2:30 - 3:30
Jason: Mon 11-12 and Tues 1:30 - 2:30
Jiajun: Tues 2:30 - 3:30 and Wed 5 - 6
Anirud: Wed 1-2 and Thur 4 - 5
Tanmay: Mon 4 - 5 and Fri 4 - 5

DAF Mon - 14h00-15h00, Fri - 14h00-15h00

or swing by my office (3310 Siebel) and see if I'm busy

Evaluation is by: Homeworks and take home final.

I will shortly post a policy on collaboration and plagiarism

Homework 5: Due 20 Mar 2017 23h59 (Mon; midnight)

You should do this homework in groups of up to three; details of how to submit have been posted on piazza.

Submission: Homework 4 submission details TBA (Piazza).

Linear regression with various regularizers The UCI Machine Learning dataset repository hosts a dataset giving features of music, and the latitude and longitude from which that music originates here. Investigate methods to predict latitude and longitude from these features, as below. There are actually two versions of this dataset. Either one is OK by me, but I think you'll find the one with more independent variables more interesting. You should ignore outliers (by this I mean you should ignore the whole question; do not try to deal with them). You should regard latitude and longitude as entirely independent.
- First, build a straightforward linear regression of latitude (resp. longitude) against features. What is the R-squared? Plot a graph evaluating each regression.
- Does a Box-Cox transformation improve the regressions? Notice that the dependent variable has some negative values, which Box-Cox doesn't like. You can deal with this by remembering that these are angles, so you get to choose the origin. why do you say so? For the rest of the exercise, use the transformation if it does improve things, otherwise, use the raw data.
- Use glmnet to produce:
  - A regression regularized by L2 (equivalently, a ridge regression). You should estimate the regularization coefficient that produces the minimum error. Is the regularized regression better than the unregularized regression?
  - A regression regularized by L1 (equivalently, a lasso regression). You should estimate the regularization coefficient that produces the minimum error. How many variables are used by this regression? Is the regularized regression better than the unregularized regression?
Logistic regression The UCI Machine Learning dataset repository hosts a dataset giving whether a Taiwanese credit card user defaults against a variety of features here. Use logistic regression to predict whether the user defaults. You should ignore outliers, but you should try the various regularization schemes we have discussed.
A wide dataset, from cancer genetics: In "Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays" by U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, Proc. Natl. Acad. Sci. USA, Vol. 96, Issue 12, 6745-6750, June 8, 1999, authors collected data giving gene expressions for tumorous and normal colon tissues. You will find this dataset here. There is a matrix of gene expression levels for 2000 genes (these are the independent variables) for 62 tissue samples. As you can see, there are a lot more independent variables than there are data items. At that website, you will also find a file giving which sample is tumorous and which is normal.
- Use a binomial regression model (i.e. logistic regression) with the lasso to predict tumorous/normal. Use cross-validation to assess how accurate your model is. Report both AUC (below) and deviance. How many genes does the best model use?
AUC: is one standard measure of classification performance, reported by glmnet; look this up here , but the key phrase is "When using normalized units, the area under the curve (often referred to as simply the AUC, or AUROC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative')."