Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(9) # Launch Homework 9
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
For all plots, you need to print/reveal them in the final knitted Rmarkdown, but you do NOT need to export them to a file with ggsave()
. Enjoy!
When pivoting, we often need to provide a bunch of columns as an argument to a pivot function. It is often easier to list these columns some select()
magic. We have previously seen the magic of everything()
(select everything else!). This magic actually comes to us from tidy-select, functionality in dplyr
for conveniently specifying a set of columns.
For this set of questions, practice some of this new tidy-select magic, and use introverse::get_help("tidyselect")
to get help!
1.0. Make a managable version of the wine_version1
dataset to use to learn and practice tidy-select skills. To do this, we will create a new tibble called wine5
that contains only the first five rows of data. To accomplish this, you can either filter()
on wine_index
or use the dplyr function slice()
. Your code, your call!
1.1. Use starts_with()
to select all columns in wine5
that start with the lowercase letter “c”.
1.2. Use contains()
to select all columns in wine5
that contain the word “sulfur” (make sure you’re spelling is right!).
1.3. Select all columns in wine5
that contain the word “sulfur” as well as the columns sulphates
and chlorides
.
1.4 You can use colons to specify a range of columns in order (just like making arrays of numbers in order, i.e. 1:10
makes the array of numbers 1-10). Use colons to select all columns in wine5
from citric_acid
through pH
.
1.5. The function last_col()
allows you to quickly select the last column in a tibble, based on the order columns appear. Select the last column only from wine5
.
1.6. Combine your colon and last_col()
skills to select all columns from citric_acid
through the last column, without ever using the name of the last column.
1.7. Use your skills to select the columns wine_index
, type
, and density
through sulphates
from wine5
.
1.8. Combine all your skills to select the following columns from wine5
: wine_index
, type
, quality
, free_sulfur_dioxide
, total_sulfur_dioxide
, and alcohol
. Your code must incorporate THREE of the new tidy-select skills used in this homework section.
2.1. Write code to convert the untidy wine_version1
tibble to a tidy tibble. Its final columns should be the same columns that appear in wine_version2
, in the same order.
2.2. Write code to convert tidy wine_version2
tibble to an untidy tibble. Its final columns should be the same columns that appear in wine_version1
, in the same order.
2.3. Wrangle the wine_version1
tibble to create a tidy tibble that contains the following columns in this order (hint: in this circumstance, quality
must be considered one of the attributes - it should not still be a column in the data!).
wine_index
type
attribute
value
2.3. Ask and answer one exploratory question about this dataset, including a figure and any necessary wrangling. No underscores in your figure, but style any other way you wish. You can use any structural version of this wine dataset in your code, up to you!
Consider the ds4b.materials
dataset seals
, which contains information about 10 seals (column seal
). Researchers had each seal dive underwater twice: once to feed (feeding
is “Yes”) and once just to dive without any feeding (feeding
is “No”). They hypothesized that seals used more oxygen (oxygen_use
) while going on feeding dives compared to non-feeding dives. For this section of the homework you will make two plots that each answer the question: Do feeding dives use more oxygen than non-feeding dives, on average?
3.1. Let’s get back to our fundamentals: Make a jitter plot showing the distributions of oxygen_use
across feeding
categories. Specifically…
feeding
category using a non-default palette. Style the rest any way you like, as long as no underscores!stat_summary()
in a clearly-visible color and size to see the means of each distribution.3.2. The goal for this question is to make a scatterplot that contains one point for each seal. The x-axis should show the oxygen use when NOT feeding, and the y-axis should show the oxygen use WHEN FEEDING.
BUT! It is not actually possible to make the plot with the data in its current structure! To achieve this plot, we need to have these values in separate columns….
pivot_wider()
to create a new version of the data that contains these three columns: seal
(the seal ID number), oxygen_feeding
(oxygen use while feeding), and oxygen_nonfeeding
(oxygen use while not feeding).ggplot()
.geom_abline()
(and you can style with a color if you want!) to create a Y=X
line. This is not a trendline - it’s literally a line with slope of 1 and y-intercept of 0. Use this line to interpret your plot so that you can answer in 1-2 sentences why or why not this figure shows evidence that feeding dives use more oxygen than do non-feeding dives.
Consider your favorite dataset on coffee ratings. The goal for this final part of the homework is to make a figure of the relationship between total_cup_points
and each of these other columns (notice how they are all adjacent in the tibble? Use a colon :
in your code!!)
aroma
flavor
aftertaste
acidity
body
balance
uniformity
clean_cup
sweetness
cupper_points
moisture
This figure should be faceted, where total_cup_points
is always on the Y-axis, and each of those variables is on each panel of the X-axis. However, this cannot be done with the data in its current format. We have to wrangle and pivot the data a little bit to be able to plot it. Let’s go!
4.1. Wrangle the coffee_ratings
data in a single pipeline.
coffee_id
which contains values 1-[number of rows]. This allows each observation of coffee to have a uniquely identifying value, so that when we pivot we still know what values belong together. Surprise, this part is written for you!! Run and engage with it to see the dplyr
helper function n()
in action (and for more, get_help("n")
)coffee_id
, total_cup_points
, and the other attributes listed above.coffee_id
, total_cup_points
, attribute
(a new character column), and value
(a new numeric column).coffee_part4
and print the tibble out.4.2. Make a figure from coffee_part4
as described. Drawing this out will help you!! You want to make a faceted scatterplot with facet_wrap()
, where total_cup_points
is plotted across each other attribute. Each panel in the plot should have its own linear line-of-best-fit. You should ensure free axis (see lecture notes from 10/6!) ranges as well, and use any other styling you wish. Show the figure in your knitted Rmd, but you do not need to export with ggsave()
.