Obtaining and setting up the homework



The actual instructions

Part 1: Reading in the data

In the top chunk named read_data, first…

  • Read in the coffee_ratings dataset directly from the URL where it lives on the internet (not from the file). This is done for you.
  • Then, since we know there is an error in this dataset where 1 row has a point score of 0, let’s remove it before proceeding. Filter the data to only keep rows rows where total_cup_points != 0, and save the data to clean_coffee_ratings.You should use this tibble, clean_coffee_ratings, for the whole homework. Feel free to change the variable name if you like as long as it remains informative and not the same as the raw data coffee_ratings.

Part 2: Go for it!!

You will be conducting exploratory data analysis on the coffee ratings dataset by answering five questions about the coffee data using a combination of wrangling and visualization. Three of these questions are asked for you, and you ask the other two questions. Each question needs at least some wrangling with one or more dplyr and/or tidyr functions, its own plot (although question 3 has two plots!), and a brief answer in 1-3 sentences. Please follow the given template to organize your code (as you did in Homework #5). You can style your plots however you want as long as you ensure professional labeling (no underscores!) of all axes and legend titles.

Importantly, for every written answer you give there MUST BE corresponding code. For the first three questions templated for you, you must conduct calculations as described.

Finally, YOU MUST SPELL CHECK. Seriously.


Specific tasks you have to do as part of questions 1-3

Question 1

  • Your code should make sure to accomplish the following goals:
    • Use wrangling to reveal a tibble of the standard deviation of each processing method’s uniformity. This tibble should have two columns named processing_method and sd_uniformity.
    • Create a figure to visualize the spread of the uniformity distributions across non-NA processing methods
  • Tips
    • Your wrangling is totally separate from your plotting. Wrangling needs three dplyr verbs.
    • Do not make a bar plot! You should plot the distributions (overlapping density, faceted histogram, boxplot, jitter, violin, etc). of uniformity across processing methods. The idea here is that the wrangling and the plot are complementary ways to think about data spread. The wrangling tells us precisely what the spread is, and the plot allows us to derive a visual sense of how spread out of the data is.


Question 2

  • Your code should make sure to accomplish the following goals:
    • Reveal a tibble of how many Robusta coffees are grown in countries that grow Robusta coffees (i.e. no rows with “0”). This tibble should have two columns named country_of_origin and number_robusta
    • Create a figure to visualize these counts.
      • For some bonus points, make a waffle plot! If you want to pursue this bonus option, you first need to install the package ggwaffle with this code (copy/paste into Console): remotes::install_github("liamgilbey/ggwaffle").
  • Tips
    • Your wrangling needs two dplyr verbs.
    • You can pipe the wrangling into the plot if you want, but not strictly required.


Question 3

  • Your code should make sure to accomplish the following goals:
    • Use wrangling to reveal a tibble of the mean of each color’s moisture, excluding unknown (NA) colors. This tibble should have two columns named color and mean_moisture.
    • Create TWO FIGURES for this question: The first should show the moisture distributions across non-NA colors, and the second should show the the literal mean values of each moisture distribution (hint: barplot!). You can make both figures in the same chunk, or create an additional (named!) chunk as needed.
  • Your answer should specifically address which figure, in your opinion, is most helpful for answering the question and why.
  • Tips
    • Your wrangling needs three dplyr verbs.
    • The first plot you have to make here is a bar plot (using geom_col()! Think about why this geom!).
    • The second plot you have to make here is literally only asking you to plot moisture across (non-NA) categories of color. It has nothing to do with the means that were necessary to calculate for making the first plot.
    • The purpose of this question is to see the difference between looking at numeric distributions as “just the mean of all those numbers” vs “the full distributions of the numbers themselves.” So, your answer should discuss which is more informative for understanding GENERALLY how moisture differs across color categories.
    • I highly recommend not making a boxplot for your second plot. It might be a neat adventure to take along the way though so you can see why I recommend no boxplots here (it will also teach you something about pitfalls of boxplots!).