Obtaining and setting up the homework

Instructions

For all plots, you need to print/reveal them in the final knitted Rmarkdown, but you do NOT need to export them to a file with ggsave(). Enjoy!

Part 1: Select magic

When pivoting, we often need to provide a bunch of columns as an argument to a pivot function. It is often easier to list these columns some select() magic. We have previously seen the magic of everything() (select everything else!). This magic actually comes to us from tidy-select, functionality in dplyr for conveniently specifying a set of columns.

For this set of questions, practice some of this new tidy-select magic, and use introverse::get_help("tidyselect") to get help!


1.0. Make a managable version of the wine_version1 dataset to use to learn and practice tidy-select skills. To do this, we will create a new tibble called wine5 that contains only the first five rows of data. To accomplish this, you can either filter() on wine_index or use the dplyr function slice(). Your code, your call!

1.1. Use starts_with() to select all columns in wine5 that start with the lowercase letter “c”.


1.2. Use contains() to select all columns in wine5 that contain the word “sulfur” (make sure you’re spelling is right!).


1.3. Select all columns in wine5 that contain the word “sulfur” as well as the columns sulphates and chlorides.


1.4 You can use colons to specify a range of columns in order (just like making arrays of numbers in order, i.e. 1:10 makes the array of numbers 1-10). Use colons to select all columns in wine5 from citric_acid through pH.


1.5. The function last_col() allows you to quickly select the last column in a tibble, based on the order columns appear. Select the last column only from wine5.


1.6. Combine your colon and last_col() skills to select all columns from citric_acid through the last column, without ever using the name of the last column.


1.7. Use your skills to select the columns wine_index, type, and density through sulphates from wine5.


1.8. Combine all your skills to select the following columns from wine5: wine_index, type, quality, free_sulfur_dioxide, total_sulfur_dioxide, and alcohol. Your code must incorporate THREE of the new tidy-select skills used in this homework section.



Part 2: Pivoting the wine versions

2.1. Write code to convert the untidy wine_version1 tibble to a tidy tibble. Its final columns should be the same columns that appear in wine_version2, in the same order.


2.2. Write code to convert tidy wine_version2 tibble to an untidy tibble. Its final columns should be the same columns that appear in wine_version1, in the same order.


2.3. Wrangle the wine_version1 tibble to create a tidy tibble that contains the following columns in this order (hint: in this circumstance, quality must be considered one of the attributes - it should not still be a column in the data!).

  • wine_index
  • type
  • attribute
  • value


2.3. Ask and answer one exploratory question about this dataset, including a figure and any necessary wrangling. No underscores in your figure, but style any other way you wish. You can use any structural version of this wine dataset in your code, up to you!



Part 3: Plotting some seals

Consider the ds4b.materials dataset seals, which contains information about 10 seals (column seal). Researchers had each seal dive underwater twice: once to feed (feeding is “Yes”) and once just to dive without any feeding (feeding is “No”). They hypothesized that seals used more oxygen (oxygen_use) while going on feeding dives compared to non-feeding dives. For this section of the homework you will make two plots that each answer the question: Do feeding dives use more oxygen than non-feeding dives, on average?


3.1. Let’s get back to our fundamentals: Make a jitter plot showing the distributions of oxygen_use across feeding categories. Specifically…

  • Color (or fill if you use a different point shape!) points based on feeding category using a non-default palette. Style the rest any way you like, as long as no underscores!
  • Add on stat_summary() in a clearly-visible color and size to see the means of each distribution.
  • In 1-2 sentences, explain why or why not this figure shows evidence that feeding dives use more oxygen than do non-feeding dives.
  • Note: One reason a jitter plot is good here is there are only 10 points for each distribution! With such few data points, we are really able to see them all at once.


3.2. The goal for this question is to make a scatterplot that contains one point for each seal. The x-axis should show the oxygen use when NOT feeding, and the y-axis should show the oxygen use WHEN FEEDING.

BUT! It is not actually possible to make the plot with the data in its current structure! To achieve this plot, we need to have these values in separate columns….

  • Wrangle the data using pivot_wider() to create a new version of the data that contains these three columns: seal (the seal ID number), oxygen_feeding (oxygen use while feeding), and oxygen_nonfeeding (oxygen use while not feeding).
  • Then, use this data to make your scatterplot, styled in any way of your choosing as long as there are no underscores. You can do this in two steps (make the dataset and then plot it) or as a single pipeline from wrangling directly into ggplot().
  • As part of your plot, add this geom: geom_abline() (and you can style with a color if you want!) to create a Y=X line. This is not a trendline - it’s literally a line with slope of 1 and y-intercept of 0. Use this line to interpret your plot so that you can answer in 1-2 sentences why or why not this figure shows evidence that feeding dives use more oxygen than do non-feeding dives.
    • Hint: If points fall on the line, that means an EQUAL amount of oxygen is used for each dive type. If most points fall on the side of the line towards “feeding” that means feeding dives use more oxygen, and vice versa.



Part 4: Let’s not forget about coffee!

Consider your favorite dataset on coffee ratings. The goal for this final part of the homework is to make a figure of the relationship between total_cup_points and each of these other columns (notice how they are all adjacent in the tibble? Use a colon : in your code!!)

  • aroma
  • flavor
  • aftertaste
  • acidity
  • body
  • balance
  • uniformity
  • clean_cup
  • sweetness
  • cupper_points
  • moisture

This figure should be faceted, where total_cup_points is always on the Y-axis, and each of those variables is on each panel of the X-axis. However, this cannot be done with the data in its current format. We have to wrangle and pivot the data a little bit to be able to plot it. Let’s go!


4.1. Wrangle the coffee_ratings data in a single pipeline.

  • Create a new column coffee_id which contains values 1-[number of rows]. This allows each observation of coffee to have a uniquely identifying value, so that when we pivot we still know what values belong together. Surprise, this part is written for you!! Run and engage with it to see the dplyr helper function n() in action (and for more, get_help("n"))
  • Tack onto the pipeline code: select only the columns you are interested in using, which include coffee_id, total_cup_points, and the other attributes listed above.
  • Use a pivot function to make the data longer and end up with the following four columns: coffee_id, total_cup_points, attribute (a new character column), and value (a new numeric column).
  • Finally, save to a new tibble called coffee_part4 and print the tibble out.


4.2. Make a figure from coffee_part4 as described. Drawing this out will help you!! You want to make a faceted scatterplot with facet_wrap(), where total_cup_points is plotted across each other attribute. Each panel in the plot should have its own linear line-of-best-fit. You should ensure free axis (see lecture notes from 10/6!) ranges as well, and use any other styling you wish. Show the figure in your knitted Rmd, but you do not need to export with ggsave().

  • HINT: Scatterplots look at relationships between NUMBERS. Things won’t work too well if you place a categorical variable on an axis!
  • For this plot, there will be no deductions if your facet labels (“strips”) contain underscores. But, I will give you bonus points if your code (in either part of question 4) is able to clean that data up so the final plot has no underscores whatsoever. Either way, axis labels should be informative and without underscores.