CS-498 Applied Machine Learning - Optional Homework

CS-498 Applied Machine Learning

Evaluation is by: Homeworks and take home final.

Optional Homework, due 2 May 23h59 (Mon; midnight)


This homework is for remission of sins, etc. or in case you enjoy these things. It's not required, but if you did poorly in an early homework and well in this one, I'll use that in putting your grade together. You should do this homework in groups of up to three; details of how to submit have been posted on piazza. You can use any programming language you care to, but I think you'll prefer R because it has tools for this (lm and glmnet).


Details and description subject to minor changes


A wide dataset, from cancer genetics: In "Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays" by U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, Proc. Natl. Acad. Sci. USA, Vol. 96, Issue 12, 6745-6750, June 8, 1999, authors collected data giving gene expressions for tumorous and normal colon tissues. You will find this dataset here. There is a matrix of gene expression levels for 2000 genes (these are the independent variables) for 62 tissue samples. As you can see, there are a lot more independent variables than there are data items. At that website, you will also find a file giving which sample is tumorous and which is normal.

AUC: is one standard measure of classification performance, reported by glmnet; look this up here , but the key phrase is "When using normalized units, the area under the curve (often referred to as simply the AUC, or AUROC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative')."