The goal of this assignment is to explore datasets and interpret the grammatical components of figures created for you from two different datasets, described below. You will NOT be submitting any code for this assignment! Instead, you will submit some brief interpretations for each of the plots below using the Homework #3 Template, which should be submitted to Canvas as a PDF document.

Below, you will see several plots, each of which was made with one of two datasets: olives or wine. These datasets can be accessed from the ds4b.materials library, as shown:

library(ds4b.materials) # need to load this ONCE per R session

# I also recommend this command which will make it easier to see the datasets interactively
# We'll learn more about this command next week!
library(tidyverse)

# Look at olives by typing olives and other functions like head(), names(), etc.
olives

# Or, see a full view with View():
View(olives)

# And you can do the same with wine:
wine
View(wine)

The olives dataset contains information about 572 olives collected from different regions across Italy. The dataset contains information about: what region the olive is from, what smaller area within each region the is olive from, and what percentages different fatty acids comprise in the olive’s oil profile.

The wine dataset contains information from a chemical analysis of three different cultivars (A, B, and C) of wine, including alcohol percentage and amounts of different chemical components. Variables include:

In order to successfully complete this assignment, you will need to spend some time using R interactively to explore these datasets!

For each plot, answer all questions below in the template. Note that bullet points are perfectly fine for your answers. In addition, there is an example plot with answers to guide you below. The types of plots application will also be extremely helpful!

Tip: If you are interested in formatting your answers to emphasize variables in a different font, I recommend you use the font “Monaco” or “Menlo” (or any other fixed-width font).

Tip 2: Make sure you write dataset variables (column names) precisely as they are in the dataset! Do not change capitalization, remove underscores, etc.

The plots to explain:


Example Plot

Please use the answers to this example guide you on how to interpret the other plots. Notice that when referring to variables in the answer, I literally and precisely refer to the relevant column name.

Moreover, the example dataset iris is literally built into R, so you can simply type iris in any R Console to see the dataset and explore it with functions we have learned. It is highly recommended to first explore the iris dataset and ensure that you can independently interpret this plot and arrive at the same answers as are shown below. Doing this first will help you understand how to do the rest of the homework.

This plot was created from the iris dataset. In this dataset, each row is an iris flower. The dataset contains five variables:

  • Sepal.Width, the iris’ sepal width in cm
  • Sepal.Length, the iris’ sepal length in cm
  • Petal.Width, the iris’ petal width in cm
  • Petal.Width, the iris’ length length in cm
  • Species, what iris species the flower is

Plot question: Does this plot suggest a relationship between iris species and sepal length? If so, what is the relationship?



Plot 1

Plot question: Does this plot suggest a relationship between wine cultivar and ash percentage? If so, what is the relationship?



Plot 2

Plot question: Considering each cultivar separately, what is the relationship (indicate positive/negative/no relationship) between color and flavonoid percentage? Provide a separate answer for each of A, B, and C cultivars.



Plot 3

Plot question: What is the shape of the malic acid percentage distribution for each cultivar? Specifically, for each of A, B, C cultivars, answer if the distribution is unimodal, bimodal, or multimodal and whether it is either roughly symmetric or has some skew (no further details needed). (Some of these answers about unimodal/bimodal/multimodal are a little ambiguous, and that’s ok!)



Plot 4

Plot question: This plot is extremely similar to the previous plot, except with different binning and other visual features. Viewed with this binning specification, now what is the shape of the malic acid percentage distribution for each cultivar? Specifically, for each of A, B, C cultivars, answer if the distribution is unimodal, bimodal, or multimodal. If it’s unimodal, also indicate whether it is either roughly symmetric or has some skew (no further details are needed).

Plot 5

Plot question: What is the shape of the oleic acid percentage distribution for each region? Specifically, for each of the regions shown, answer if the distribution is unimodal, bimodal, or multimodal. If it’s unimodal, also indicate whether it is either roughly symmetric or has some skew (no further details are needed).



Plot 6

Plot question: Do the stearic acid percentage distributions across the three regions appear to have the same or different means and standard deviations? Compare these two summary statistics qualitatively (i.e. no need to calculate anything formally) all three regions.



Plot 7

Plot question (two parts): From which area are there the most individual samples in this dataset? This question is a good example of a question that can be answered from the plot, even though this question isn’t the specific motivating question for the plot itself. Therefore, also answer here: What do you think the primary motivating question about the data is for this visualization?



Plot 8

Plot Question: What is the y-axis variable in this plot? To answer this question you need to explore the data (for example, see the range of values on the y-axis? use that to your advantage!)