Farzad Kamalabadi, ECE

Romit Roy Choudhury, ECE

Richard Sowers, ISE

Matthew West, ME

ChengXiang Zhai, CS

daf@uiuc.edu, daf@illinois.edu

Tue 16h00-17h15 1131 Siebel

Office Hours: TBA

or swing by my office (3310 Siebel) and see if I'm busy

Evaluation is by: Participation, Homework

I will shortly post a policy on collaboration and plagiarism

Doing Data Science, Cathy O'Neil and Rachel Schutt (O'Reilly Media), 2013

- (Due 10 Feb; you can do this in groups of up to three) Find an interesting dataset on the web. Each data item should have at least five entries, and there should be at least 1000 data items. You might find the UC Irvine machine learning data repository helpful. Produce at least three figures, investigating and explaining a substantive point about this data. Email a PDF of these figures and a short explanation of what is going on and why you believe this, or a URL, to daf@uiuc.edu with 199-HW-1 in the email title, by 23h59 10 Feb. You may use any code base, programming language, tools, etc. that you care to.
- Homework 2 is big enough to have its own page
- Homework 3 is also big enough to have its own page

## Datasets:

- UC Irvine Machine Learning Data Repository
## Domain examples:

Engineering Problem:Construction Activity Analysis for Productivity Improvement

Case:Earthmoving Operations including excavators and dump trucks

Technical solution:Supervised Learning (SVM)

Data(There is 65Gbs data, password protected; password: BigDataIllinois2014)

Engineering Problem:Predicting the Number of Weather Impact Days for Construction using Hydrological Data

Case:Ikenberry Dining and Residential Hall Construction Projects – 81 impact days during 2008-2010

Technical Solution:K-Means Clustering Techniques

Data(20 txt files that need to be parsed in based on the following description, password protected; password: BigDataIllinois2014)

Engineering Problem:Understanding activities using smartphone data

Case:Classify accelerometer data into activities, regress calories against activity labels

Technical Solution:Vector quantization, classification, regression

Data is on this website

Engineering Problem:Identifying a phone from its accelerometer

Case:Look at accelerometer signatures and tell which phone this is

Technical Solution:Vector quantization, classification

Data is on this website

Engineering Problem:Understand and predict user behavior

Case:Yelp reviews of restaurants for a large urban area - what do reviewers do?

Technical Solution:Visualization, summaries, vector quantization, classification

Data is on this website

Engineering Problem:Understand where aerosol particles are coming from

Case:Air pollution data from Paris

Technical Solution:Clustering, vector quantization, classification

Data is on this website

Engineering Problem:Who does well in MOOCS

Case:Completion and grade data for a UIUC MOOC

Technical Solution:Regression

Data is on this website

## Topic model and text tools:

- RTextTools is a collection of tools for text data using R. Notice the variety of examples and tutorials there.
- OpenNLP is an extensive set of text processing tools in Java.
- Ingo Feinerer has produced a tokenizer in R, called tm; you can find it here.
- David Blei’s research group has produced a great deal of topic modelling software. A guide can be found here; There is software in C, C++, Python and R
- Mark Steyvers and Tom Griffiths have produced a topic modelling toolbox for Matlab. The code, and a series of example scripts for a variety of topic models, can be found here.
- Bettina Gr¨un and Kurt Hornik have produced an interface to Blei’s C/C++ code for R, called topicmodels. There is a nice tutorial on using this package together with RTextTools here.

## Slides:

- Introductory Slides
- Data LifeCycle Slides
- Visualization Slides
- Linear Regression (based mainly on notes)
- Texture and nearest neighbor regression slides
- Sean Massung's notes on the Yelp data
## Notes:

- Visualization and Descriptive Statistics
- Big block of notes

- Visualization, Descriptive statistics, Classification with logistic regression and SVM's, Trees and Forests (3 Feb version, PDF)
- R Code that I used for the Iris figures
- R Code that I used for the fire figures
- R Code that I used for the heart figures
- Revised big block of notes

- as above, but with some corrections, nearest neighbors, some R code in sections (10 Feb version, PDF)
- Revised big block of notes

- as above, but with some corrections, nearest neighbors, some R code in sections, regression chapter (13 Feb version, PDF)
- Revised again

- as above, corrections, non-parametric regression, more R code, clustering (11 Mar version, PDF)
- Revised again

- as above, corrections, non-parametric regression, more R code, much more clustering (31 Mar version, PDF)
- Revised yet again

- now with two more fairly detailed vector quantization examples (14 April version, PDF)
- Major revisions

- new chapter on topic models; incorporated a bunch of probability stuff (22 April version, PDF)
## R Resources:

- Obtain R from here
- A tutorial

- but there are lots of these; google 'r tutorial'
- Good place to look up ggplot2 weirdness