CS-199 Big Data Homework 2

Building several spam filters:

In this homework, you will build several spam filters of varying degrees of complexity. I have tried to structure the homework so that you can do it with minimal programming skills, but more programming will produce a more rewarding exercise. You may do the homework in groups of up to four, though I expect more ambitious work from larger groups. It is due 4 Mar 2013. Submit a PDF by email, naming all group members, explaining your results and how you got them on each step, and presenting all relevant details. Please ensure the email has CS199-BD in the subject.

Step 1: Comparing two classifiers on email spam:

Find the SPAMBASE dataset at the UC Irvine machine learning repository. This dataset has been preprocessed, so that all attributes are numbers, etc. For this dataset, compute cross-validated estimates of accuracy for (a) a randomized decision forest and (b) a nearest-neighbor classifier. You should be able to do this very easily by copy-coding from the example R code in the notes. Use at least five folds (i.e. five different random splits of data into training and test). Use 90% of the data for training, and 10% of the data to test. How accurate is each classifier? Is one better than the other?

Step 2: Filtering SMS spam:

Now get the SMS Spam collection from the UC Irvine machine learning repository. This is a collection of SMS messages, some of which are spam. The nuisance with this collection is that it consists of actual text messages, rather than numerical features. We need to get the collection into a form that can be used to train and test a classifier. We will do so by forming a collection of tokens that are commonly used by spam messages and another collection that is commonly used by non-spam messages. This gives a vocabulary (the number of distinct tokens). For each message, build a feature vector that is the size of the vocabulary. Put in a 1 if the token is present, and a 0 if it is absent. Now use this feature vector to classify. Do not write your own tokenizer, unless you have a lot of spare time. Instead, start with the R tokenizer tm (further details; home page). Which classifier works best here? can you beat the performance of simply classifying everything as non-spam?

Step 3: Playing with tokenization:

Tokenization has a significant effect on performance. Can you improve performance of the SMS spam filter by playing with the tokenizer? For example, by increasing vocabulary size? Can you beat the performance of the methods described in the paper below, which you can find here?

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. "Contributions to the Study of SMS Spam Filtering: New Collection and Results." Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

Resources:

Automated spam filtering using classifiers has been well studied. There are a variety of informative resources.

J.M. Gomez' web page
Rather a good spam filter, using Naive Bayes, which I didn't teach (bogofilter)
Enron Email Dataset - a huge dataset of real email. There's some information on how to use it in the talk "Introducing the Enron Corpus" talk by Bryan Klimt and Yiming Yang at the First Conference on Email and Anti-Spam (CEAS).
Another, rather cleaned up version of Enron