daf@uiuc.edu, daf@illinois.edu
Tue 16h00-17h15 1131 Siebel
Office Hours: TBA
or swing by my office (3310 Siebel) and see if I'm busy
Evaluation is by: Participation, Homework
I will shortly post a policy on collaboration and plagiarism
Doing Data Science, Cathy O'Neil and Rachel Schutt (O'Reilly Media), 2013
Datasets:
- UC Irvine Machine Learning Data Repository
Domain examples:
Engineering Problem: Construction Activity Analysis for Productivity Improvement
Case: Earthmoving Operations including excavators and dump trucks
Technical solution: Supervised Learning (SVM)
Data (There is 65Gbs data, password protected; password: BigDataIllinois2014)Engineering Problem: Predicting the Number of Weather Impact Days for Construction using Hydrological Data
Case: Ikenberry Dining and Residential Hall Construction Projects – 81 impact days during 2008-2010
Technical Solution: K-Means Clustering Techniques
Data (20 txt files that need to be parsed in based on the following description, password protected; password: BigDataIllinois2014)Engineering Problem: Understanding activities using smartphone data
Case: Classify accelerometer data into activities, regress calories against activity labels
Technical Solution: Vector quantization, classification, regression
Data is on this websiteEngineering Problem: Identifying a phone from its accelerometer
Case: Look at accelerometer signatures and tell which phone this is
Technical Solution: Vector quantization, classification
Data is on this websiteEngineering Problem: Understand and predict user behavior
Case: Yelp reviews of restaurants for a large urban area - what do reviewers do?
Technical Solution: Visualization, summaries, vector quantization, classification
Data is on this websiteEngineering Problem: Understand where aerosol particles are coming from
Case: Air pollution data from Paris
Technical Solution: Clustering, vector quantization, classification
Data is on this websiteEngineering Problem: Who does well in MOOCS
Case: Completion and grade data for a UIUC MOOC
Technical Solution: Regression
Data is on this website
Topic model and text tools:
- RTextTools is a collection of tools for text data using R. Notice the variety of examples and tutorials there.
- OpenNLP is an extensive set of text processing tools in Java.
- Ingo Feinerer has produced a tokenizer in R, called tm; you can find it here.
- David Blei’s research group has produced a great deal of topic modelling software. A guide can be found here; There is software in C, C++, Python and R
- Mark Steyvers and Tom Griffiths have produced a topic modelling toolbox for Matlab. The code, and a series of example scripts for a variety of topic models, can be found here.
- Bettina Gr¨un and Kurt Hornik have produced an interface to Blei’s C/C++ code for R, called topicmodels. There is a nice tutorial on using this package together with RTextTools here.
Slides:
- Introductory Slides
- Data LifeCycle Slides
- Visualization Slides
- Linear Regression (based mainly on notes)
- Texture and nearest neighbor regression slides
- Sean Massung's notes on the Yelp data
Notes:
- Visualization and Descriptive Statistics
- Big block of notes
- Visualization, Descriptive statistics, Classification with logistic regression and SVM's, Trees and Forests (3 Feb version, PDF)
- R Code that I used for the Iris figures
- R Code that I used for the fire figures
- R Code that I used for the heart figures
- Revised big block of notes
- as above, but with some corrections, nearest neighbors, some R code in sections (10 Feb version, PDF)
- Revised big block of notes
- as above, but with some corrections, nearest neighbors, some R code in sections, regression chapter (13 Feb version, PDF)
- Revised again
- as above, corrections, non-parametric regression, more R code, clustering (11 Mar version, PDF)
- Revised again
- as above, corrections, non-parametric regression, more R code, much more clustering (31 Mar version, PDF)
- Revised yet again
- now with two more fairly detailed vector quantization examples (14 April version, PDF)
- Major revisions
- new chapter on topic models; incorporated a bunch of probability stuff (22 April version, PDF)
R Resources:
- Obtain R from here
- A tutorial
- but there are lots of these; google 'r tutorial'
- Good place to look up ggplot2 weirdness