CS-498 Applied Machine Learning - Additional Homework

Additional Homework, due 2 May 23h59 (Mon; midnight)


This homework is for those taking the course for 4 hours credit. You should do this homework in groups of up to three; details of how to submit have been posted on piazza. You can use any programming language you care to, but I think you'll prefer R because it has tools for this (lm and glmnet).


Details and description subject to minor changes


A bunch of wide datasets, from cancer genetics: In "Clustering Cancer Gene Expression Data: a Comparative Study", by Marcilio C. P. de Souto, Ivan G. Costa, Daniel S. A. de Araujo, Teresa B. Ludermir, Alexander, authors clustered gene expression data for various cancers. In particular, they collected together 35 cancer gene expression datasets. You can find descriptive text here. There is a link to the datasets on that page; it links to here. Each dataset is "wide" (more predictors than examples). Each dataset contains predictors for various classes of medical issue; the number of classes ranges from 2 to 14. This exercise will involve more dataset jockeying than before; that's life.