The goal of this assignment is to explore datasets and interpret the grammatical components of figures created for you from two different datasets, described below. You will NOT be submitting any code for this assignment! Instead, you will submit some brief interpretations for each of the plots below using the Homework #3 Template, which should be submitted to Canvas as a PDF document.
Below, you will see several plots, each of which was made with one of two datasets: olives
or wine
. These datasets can be accessed from the ds4b.materials
library, as shown:
library(ds4b.materials) # need to load this ONCE per R session
# I also recommend this command which will make it easier to see the datasets interactively
# We'll learn more about this command next week!
library(tidyverse)
# Look at olives by typing olives and other functions like head(), names(), etc.
olives
# Or, see a full view with View():
View(olives)
# And you can do the same with wine:
wineView(wine)
The olives
dataset contains information about 572 olives collected from different regions across Italy. The dataset contains information about: what region
the olive is from, what smaller area
within each region the is olive from, and what percentages different fatty acids comprise in the olive’s oil profile.
The wine
dataset contains information from a chemical analysis of three different cultivars (A
, B
, and C
) of wine, including alcohol percentage and amounts of different chemical components. Variables include:
cultivar
: The wine cultivar (A, B, or C)alcohol
: The alcohol percentage of the winemalic_acid
: The percentage of the wine that is malic acidash
: The percentage of the wine that is ash (it’s a wine thing…)magnesium
: The percentage of the wine that is magnesiumtotal_phenol
: The percentage of the wine that is phenolsflavonoids
: The percentage of the wine that is flavonoidsnonflav_phenols
: The percentage of the wine that is non-flavonoid phenolscolor
: The color intensity of the wine, measured numericallyIn order to successfully complete this assignment, you will need to spend some time using R interactively to explore these datasets!
For each plot, answer all questions below in the template. Note that bullet points are perfectly fine for your answers. In addition, there is an example plot with answers to guide you below. The types of plots application will also be extremely helpful!
Tip: If you are interested in formatting your answers to emphasize variables in a different font, I recommend you use the font “Monaco” or “Menlo” (or any other fixed-width font).
Tip 2: Make sure you write dataset variables (column names) precisely as they are in the dataset! Do not change capitalization, remove underscores, etc.
Please use the answers to this example guide you on how to interpret the other plots. Notice that when referring to variables in the answer, I literally and precisely refer to the relevant column name.
Moreover, the example dataset
iris
is literally built into R, so you can simply typeiris
in any R Console to see the dataset and explore it with functions we have learned. It is highly recommended to first explore theiris
dataset and ensure that you can independently interpret this plot and arrive at the same answers as are shown below. Doing this first will help you understand how to do the rest of the homework.
This plot was created from the iris
dataset. In this dataset, each row is an iris flower. The dataset contains five variables:
Sepal.Width
, the iris’ sepal width in cmSepal.Length
, the iris’ sepal length in cmPetal.Width
, the iris’ petal width in cmPetal.Width
, the iris’ length length in cmSpecies
, what iris species the flower isPlot question: Does this plot suggest a relationship between iris species and sepal length? If so, what is the relationship?
Plot question: Does this plot suggest a relationship between wine cultivar and ash percentage? If so, what is the relationship?
Plot question: Considering each cultivar separately, what is the relationship (indicate positive/negative/no relationship) between color and flavonoid percentage? Provide a separate answer for each of A, B, and C cultivars.
Plot question: What is the shape of the malic acid percentage distribution for each cultivar? Specifically, for each of A, B, C cultivars, answer if the distribution is unimodal, bimodal, or multimodal and whether it is either roughly symmetric or has some skew (no further details needed). (Some of these answers about unimodal/bimodal/multimodal are a little ambiguous, and that’s ok!)
Plot question: This plot is extremely similar to the previous plot, except with different binning and other visual features. Viewed with this binning specification, now what is the shape of the malic acid percentage distribution for each cultivar? Specifically, for each of A, B, C cultivars, answer if the distribution is unimodal, bimodal, or multimodal. If it’s unimodal, also indicate whether it is either roughly symmetric or has some skew (no further details are needed).
Plot question: What is the shape of the oleic acid percentage distribution for each region? Specifically, for each of the regions shown, answer if the distribution is unimodal, bimodal, or multimodal. If it’s unimodal, also indicate whether it is either roughly symmetric or has some skew (no further details are needed).
Plot question: Do the stearic acid percentage distributions across the three regions appear to have the same or different means and standard deviations? Compare these two summary statistics qualitatively (i.e. no need to calculate anything formally) all three regions.
Plot question (two parts): From which area are there the most individual samples in this dataset? This question is a good example of a question that can be answered from the plot, even though this question isn’t the specific motivating question for the plot itself. Therefore, also answer here: What do you think the primary motivating question about the data is for this visualization?
Plot Question: What is the y-axis variable in this plot? To answer this question you need to explore the data (for example, see the range of values on the y-axis? use that to your advantage!)