Office Hours Time: TBA, Location: TBA
You must do this homework individually
Submission: Course submission policy is as described in the homework
You may use any programming language that amuses you for this homework. You may use a PCA package if you so choose but remember you need to understand what comes out of the package to get the homework right!
The goal of this homework is to use PCA to smooth the noise in the provided data. At https://www.kaggle.com/t/e9337b95218e48a1be69a69e3826688a , you will find a five noisy versions of the Iris dataset, and a noiseless version.
For each of the 5 noisy data sets, you should compute the principle components in two ways. In the first, you will use the mean and covariance matrix of the noiseless dataset. In the second, you will use the mean and covariance of the respective noisy datasets. Based on these components, you should compute the mean squared error between the noiseless version of the dataset and each of a PCA representation using 0 (i.e. every data item is represented by the mean), 1, 2, 3, and 4 principal components.
You should produce:
Number of PCs-> | 0N | 1N | 2N | 3N | 4N | 0c | 1c | 2c | 3c | 4c |
---|---|---|---|---|---|---|---|---|---|---|
Dataset I | ||||||||||
Dataset II | ||||||||||
Dataset III | ||||||||||
Dataset IV | ||||||||||
Dataset V |
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"
Each following line is your reconstruction of a data item, in order (so first data item first, etc).
"0N, 1N, 2N, 3N, 4N, 0c, 1c, 2c, 3c, 4c"
The following lines should be the rows of the table, in order, and contain only numbers. You should provide your numbers to at least three digits.