CS-498 Applied Machine Learning - Optional Homework

CS-498 Applied Machine Learning

D.A. Forsyth --- 3310 Siebel Center

daf@uiuc.edu, daf@illinois.edu

15:30 - 16:45 OR 3.30 pm-4.45 pm, in old money
1320 Digital Computer Laboratory


Mariya Vasileva mvasile2@illinois.edu

Sili Hui silihui2@illinois.edu

Daeyun Shin dshin11@illinois.edu

Ayush Jain ajain42@illinois.edu

Office Hours:

Ayush Fri - 14h00-16h00 or 2-4 pm, location: in front of 3304

Daeyun Thu - 11h00-13h00 or 11 am-1 pm location: 0207 Siebel/p>

Mariya Wed - 15h00-17h00 or 3 - 5 pm location: 0207 Siebel

Sili Thur - 12h00-14h00 or 12 - 2 pm location: 0207 Siebel

DAF Mon - 14h00-15h00, Fri - 14h00-15h00

or swing by my office (3310 Siebel) and see if I'm busy

Evaluation is by: Homeworks and take home final.

I will shortly post a policy on collaboration and plagiarism





Optional Homework, due 2 May 23h59 (Mon; midnight)


This homework is for remission of sins, etc. or in case you enjoy these things. It's not required, but if you did poorly in an early homework and well in this one, I'll use that in putting your grade together. You should do this homework in groups of up to three; details of how to submit have been posted on piazza. You can use any programming language you care to, but I think you'll prefer R because it has tools for this (lm and glmnet).


Details and description subject to minor changes


A wide dataset, from cancer genetics: In "Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays" by U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, Proc. Natl. Acad. Sci. USA, Vol. 96, Issue 12, 6745-6750, June 8, 1999, authors collected data giving gene expressions for tumorous and normal colon tissues. You will find this dataset here. There is a matrix of gene expression levels for 2000 genes (these are the independent variables) for 62 tissue samples. As you can see, there are a lot more independent variables than there are data items. At that website, you will also find a file giving which sample is tumorous and which is normal.

AUC: is one standard measure of classification performance, reported by glmnet; look this up here , but the key phrase is "When using normalized units, the area under the curve (often referred to as simply the AUC, or AUROC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative')."