CS-498 Applied Machine Learning - Homework 3

CS-498 Applied Machine Learning

D.A. Forsyth --- 3310 Siebel Center

daf@uiuc.edu, daf@illinois.edu

15:30 - 16:45 OR 3.30 pm-4.45 pm, in old money
TuTh
1320 Digital Computer Laboratory

TA's:

Mariya Vasileva mvasile2@illinois.edu

Sili Hui silihui2@illinois.edu

Daeyun Shin dshin11@illinois.edu

Ayush Jain ajain42@illinois.edu

Office Hours:

Ayush Tue, Thu - 17h00-18h00 or 5-6 pm in old money

Daeyun Thu - 11h00-13h00 or 11 am-1 pm in old money

Mariya Wed - 15h00-17h00 or 3 - 5 pm in old money

Sili Thur - 12h00-14h00 or 12 - 2 pm in old money

DAF Mon - 14h00-15h00, Fri - 14h00-15h00

or swing by my office (3310 Siebel) and see if I'm busy

Evaluation is by: Homeworks and take home final.

I will shortly post a policy on collaboration and plagiarism

Homework 2: Due 22 Feb 2016 23h59 (Mon; midnight)

You should do this homework in groups of up to three; details of how to submit have been posted on piazza.

Details and description subjet to minor changes

Submission: Homework 3 submission details TBA.

You will find a training dataset to do with matching faces here. Each vector consists of measurements of attributes from two faces. If the faces belong to the same class, the first element of the vector is zero, otherwise it is one. The rest of the vector consists of (attribute values for face 1) (attribute values for face 2). You will use this dataset to train and evaluate various classifiers. Be aware that this is a large dataset (about 100M), so don't download repeatedly or all at the same time. Fairly soon, we will release various test datasets on Kaggle. The homework has several components.

Train and evaluate the following classifiers, using this data for training and the evaluation data we will release. You can start without the evaluation data by doing test-train splits on this training data.
1. Linear SVM (you may use a package or your own code)
2. Naive Bayes (you may use a package or your own code)
3. Random Forests (I strongly advise you use a package, but...)
This part will comprise approximately 1/3 of the grade. You should report: training error for each of these classifiers; test error for each on the evaluation sets we release on Kaggle.
The original data can be found here; a clear and helpful account of how the data was made can be found here. You will find each of the attribute vectors for each face here. Notice that, in principle, you could build a classifier for each pair by matching the first attribute vector to this dataset, matching the second attribute vector to the dataset, then seeing if the two names are the same. Doing this in practice presents challenges, because there is so much data. Use an approximate nearest neighbors package (I favor FLANN or RANN) to build such a matcher. Report its accuracy on the training data and the evaluation sets we release on Kaggle. This part will comprise approximately 1/3 of the grade.
Use any method you like, apart from matching as in part 2 to build a classifier that is as accurate as you can make it. Report its accuracy on the training data and on the evaluation sets we release on Kaggle. This part will comprise approximately 1/3 of the grade, weighted so that any reasonable contribution gets half the available marks, and the other half are allocated based on the accuracy of your method.