Enter your name here
For this assignment, we will use the wine.csv
dataset to attempt to recover Cultivar clusters. Therefore, you should remove the Cultivar variable, which is categorical, from your data (as just as “Species” was removed in lecture from iris analysis) before running PCA or clustering. To ensure that Cultivar can be plotted as a categorical variable, you can refer to the column as factor(Cultivar)
in your ggplot()
code. Note that you may provide written answers in bullet point form, but this must be formatted correctly (knit the document to check!!!).
Question 1 (10 points). First, visualize the wine dataset by creating three scatterplots where points are colored by Cultivar. You may choose which variables are used for each plot, and variables can be reused between plots (just don’t swap X-Y and call it a new plot). Then, state whether any of the plots show, or do not show, Cultivar groupings. (Note: You are doing this question to see that there may be limitations to only considering two numeric variables at a time for high dimensional data.)
### Code goes here
Answer goes here.
Question 2 (20 points). Perform a Principal Components Analysis to ascertain whether the wine.csv
dataset can recover cultivar groups. Visualize your PCA with the following plots:
### Code goes here
Question 3. (15 points) Using the visualizations created in questions 2, interpret your PCA based on the following prompts.
Question 3a: Intepret your scatterplots: Does either PC1, PC2, or PC3 effectively separate cultivars? State which PC(s), if any, discriminate cultivar, and which PC(s) do not. Do you find that PCA was more effective at separating clusters than your original scatterplots with two random variables were?
Answer goes here.
Question 3b: Interpret your PC loadings: Which numeric variables are primarily discriminated by PC1? Which are primarily discriminated by PC2? Do any variables show exact opposite loadings, and if so which? Do you notice any orthogonal variables (or sets of variables) in PC1-2 space?
Answer goes here.
Question 4 (20 points). Perform k-means clustering on the wine data to ascertain whether clusters. Do this in two steps: i) Use the elbow method to visually determing how many clusters to make, and ii) Perform the k-means clustering with the k determined from the elbow method three different times. You should therefore end up creating 3 different variables of k-means data (all using the same k). Be sure to include a comment in your code indicating which K has been chosen based on the elbow.
### Code goes here
Question 5 (20 points). Visualize your k-means results: Make faceted dodged bar plot, where each facet is a k-means result, to visualize the distributions of clusters across cultivars. Use .cluster
for the x axis, and fill by Cultivar
(note you will need to specify to ggplot that these are factors). Based on this figure, address these questions: i) Are k-means results consistent or are they different from each other, in terms of the relationship between cluster and cultivar? (Hint: cluster index 1-k is totally arbitrary!), and ii) Do any cultivars seem to belong to, or not belong to, specific cluster(s)?
### Code goes here
Answer goes here.
Question 6 (15 points). Conduct a hypothesis test for one of your k-means results to determine whether there is an association between cluster and cultivar. Give your null/alternative hypotheses, results, and conclusions.
### Code goes here
Answer goes here.