Instructions: Homework #3

The goal of this assignment is to explore datasets and interpret the grammatical components of figures created for you from two different datasets, described below. You will NOT be submitting any code for this assignment! Instead, you will submit some brief interpretations for each of the plots below using the Homework #3 Template, which should be submitted to Canvas as a PDF document.

Below, you will see several plots, each of which was made with one of two datasets: olives or wine. These datasets can be accessed from the ds4b.materials library, as shown:

library(ds4b.materials) # need to load this ONCE per R session

# I also recommend this command which will make it easier to see the datasets interactively
# We'll learn more about this command next week!
library(tidyverse)

# Look at olives by typing olives and other functions like head(), names(), etc.
olives

# Or, see a full view with View():
View(olives)

# And you can do the same with wine:
wine
View(wine)

The olives dataset contains information about 572 olives collected from different regions across Italy. The dataset contains information about: what region the olive is from, what smaller area within each region the is olive from, and what percentages different fatty acids comprise in the olive’s oil profile.

The wine dataset contains information from a chemical analysis of three different cultivars (A, B, and C) of wine, including alcohol percentage and amounts of different chemical components. Variables include:

cultivar: The wine cultivar (A, B, or C)
alcohol: The alcohol percentage of the wine
malic_acid: The percentage of the wine that is malic acid
ash: The percentage of the wine that is ash (it’s a wine thing…)
magnesium: The percentage of the wine that is magnesium
total_phenol: The percentage of the wine that is phenols
flavonoids: The percentage of the wine that is flavonoids
nonflav_phenols: The percentage of the wine that is non-flavonoid phenols
color: The color intensity of the wine, measured numerically

In order to successfully complete this assignment, you will need to spend some time using R interactively to explore these datasets!

For each plot, answer all questions below in the template. Note that bullet points are perfectly fine for your answers. In addition, there is an example plot with answers to guide you below. The types of plots application will also be extremely helpful!

What type of plot is shown? Don’t get too fancy here! All plots are one of: histogram, density plot, boxplot, violin plot, strip plot, barplot, or scatterplot.
Which dataset is the plot made from?
Which variables (ie, the actual column names) in the dataset are on the X and Y axis, if any? If an axis does NOT show a variable but instead contains a statistical transformation of the data, indicate that in your answer instead.
- For axes with actual variables (not transformations!), what type of variable is on each axis? Answer either numeric or categorical - you do not need to go more in-depth than that.
What visual features (including colors/fills, shapes of points, and/or sizes of points) are used to style the plot? Are these aesthetics that map to the data, or are they “just” visual features? For any aesthetic mappings, clearly indicate what variables (actual column name) are mapped to what visual feature.
Does the plot contain facets across a variable (column), or is it shown in a single panel? If faceting is present, what variable (actual column name) is the faceting along?
Briefly answer the given question prompting you to interpret each plot in 1-2 sentences.

Tip: If you are interested in formatting your answers to emphasize variables in a different font, I recommend you use the font “Monaco” or “Menlo” (or any other fixed-width font).

Tip 2: Make sure you write dataset variables (column names) precisely as they are in the dataset! Do not change capitalization, remove underscores, etc.

The plots to explain:

Example Plot

Please use the answers to this example guide you on how to interpret the other plots. Notice that when referring to variables in the answer, I literally and precisely refer to the relevant column name.

Moreover, the example dataset iris is literally built into R, so you can simply type iris in any R Console to see the dataset and explore it with functions we have learned. It is highly recommended to first explore the iris dataset and ensure that you can independently interpret this plot and arrive at the same answers as are shown below. Doing this first will help you understand how to do the rest of the homework.

This plot was created from the iris dataset. In this dataset, each row is an iris flower. The dataset contains five variables:

Sepal.Width, the iris’ sepal width in cm
Sepal.Length, the iris’ sepal length in cm
Petal.Width, the iris’ petal width in cm
Petal.Width, the iris’ length length in cm
Species, what iris species the flower is

Plot question: Does this plot suggest a relationship between iris species and sepal length? If so, what is the relationship?

Plot 1

Plot question: Does this plot suggest a relationship between wine cultivar and ash percentage? If so, what is the relationship?

Plot 2

Plot question: Considering each cultivar separately, what is the relationship (indicate positive/negative/no relationship) between color and flavonoid percentage? Provide a separate answer for each of A, B, and C cultivars.

Plot 3

Plot question: What is the shape of the malic acid percentage distribution for each cultivar? Specifically, for each of A, B, C cultivars, answer if the distribution is unimodal, bimodal, or multimodal and whether it is either roughly symmetric or has some skew (no further details needed). (Some of these answers about unimodal/bimodal/multimodal are a little ambiguous, and that’s ok!)

Plot 4

Plot question: This plot is extremely similar to the previous plot, except with different binning and other visual features. Viewed with this binning specification, now what is the shape of the malic acid percentage distribution for each cultivar? Specifically, for each of A, B, C cultivars, answer if the distribution is unimodal, bimodal, or multimodal. If it’s unimodal, also indicate whether it is either roughly symmetric or has some skew (no further details are needed).

Plot 5

Plot question: What is the shape of the oleic acid percentage distribution for each region? Specifically, for each of the regions shown, answer if the distribution is unimodal, bimodal, or multimodal. If it’s unimodal, also indicate whether it is either roughly symmetric or has some skew (no further details are needed).

Plot 6

Plot question: Do the stearic acid percentage distributions across the three regions appear to have the same or different means and standard deviations? Compare these two summary statistics qualitatively (i.e. no need to calculate anything formally) all three regions.

Plot 7

Plot question (two parts): From which area are there the most individual samples in this dataset? This question is a good example of a question that can be answered from the plot, even though this question isn’t the specific motivating question for the plot itself. Therefore, also answer here: What do you think the primary motivating question about the data is for this visualization?

Plot 8

Plot Question: What is the y-axis variable in this plot? To answer this question you need to explore the data (for example, see the range of values on the y-axis? use that to your advantage!)