Obtaining and setting up the homework

Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
```
library(ds4b.materials) # Load the class library
launch_homework(6)      # Launch Homework 6
```
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
- Choose your favorite theme among the pre-packaged themes (ignore everything below “Even More Themes”) shown at this link
- Choose your favorite syntax highlighting among these options at this link
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
- If there are errors in your code, you should comment out the code so that it does not actually run. This is BETTER than keeping the buggy code in there without commenting out - it shows me you attempted the code, but understood that it didn’t work properly. Partial credit will come to you! But, if you leave buggy code in, then the Rmd will not knit and there will be deductions.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.

Overview

For this homework, you will be…

Practicing reading datasets into R and saving figures to files.
Beginning to engage with dplyr and tidyr packages for data wrangling. These packages are part of the core tidyverse and are loaded when you run library(tidyverse).
- Practicing using the pipe %>% operator a whole bunch (get_help("pipe"))
- Want more practice and help with some of these functions? Check out corresponding exercises here.
Creating your own Rmarkdown organization!! For each question, you should create a level-4 header with the question number and named code chunk to provide the answer in. Make sure to knit frequently to make sure the format is as you intend! Question 1’s formatting is started for you.
- Remember these short videos to help navigate RMarkdown, as well as your activities/rmarkdown_demonstration.Rmd document! Don’t have the document? Load the ds4b.materials library and run launch_activity("rmarkdown") and see class recording from Thursday 9/16/21.

The actual instructions

Part 1: Setting up directories and obtaining the dataset

On the class website, there is a CSV file you can download called coffee_ratings.csv which contains the dataset documented here. Take a few minutes to read about this dataset before you begin.

To begin the homework, you need to take the following steps:

In RStudio Cloud, use the “New Folder” button in the Files Pane to create two new directories (folders) in your RStudio Cloud account inside the homeworks/ directory called figures/ and datasets/. In other words you are creating directories that will end up having these full paths:
- /cloud/project/homeworks/figures/
- /cloud/project/homeworks/datasets/
Download the CSV file to your computer from this link
- To force the file to download rather than appearing in a new tab, right click the link and select “Save Link As” to download it.
- CAUTION: When you download this file, your computer may save it as coffee_ratings.txt instead of coffee_ratings.csv. THAT’S OK!! You should use whatever extension it downloaded as when you are referring to the file name! It will not affect the file or the procedure to read it in. Just make sure to refer to the file name that yours downloaded as. If you are feeling fancy, you can always rename it to coffee_ratings.csv once it’s in the cloud (keep reading!) using the “Rename” button.
Upload the CSV file to your RStudio Cloud, and save it in the new directory /cloud/project/homeworks/datasets/

Part 2: Defining paths you will need for the homework

In the Homework 6 template, you will see a chunk at the top whose name is paths_and_data. Here, we want to define paths to certain directories and files this homework uses. However, there’s a major catch: When we are interactively working in Console, the default working directory in RStudio Cloud is /cloud/project/. But, when you click the knit button for an Rmd file, the knitting engine assumes the working directory is where ever the Rmd file is saved, which in this case is /cloud/project/homeworks/. This leads to very frustrating conflicts that bothers pretty much everyone working in R!!

The solution arrives to us from the amazing {here} package, which provides help clarifying paths in these annoying situations. The package has a function called here() (same name!) which is typically called as here::here() (following syntax package::function()) that automatically finds the “top-level directory” of an R project on whichever machine it is being run on!! It’s super helpful.

Returning to the paths_and_data chunk, you will need to create three variables using the file.path() function:

path_to_datasets, which should be the path to the homeworks/datasets/ directory (this one has been done for you!)
path_to_figures, which should be the path to the homeworks/figures/ directory
coffee_ratings_file, which should be the path to the coffee_ratings.csv (or coffee_ratings.txt if that’s how yours saved) file you saved in the new homeworks/datasets/ directory. This variable definition must use the path_to_datasets variable.

Then, in this chunk, you should read in the coffee ratings dataset with read_csv() and save it to a variable coffee_ratings for use in the assignment.

Part 3: Questions

All code should be formatted as pipelines with %>%!! This is REQUIRED.

Use filter() to subset the data to contain only “Arabica” (column species) coffees in the dataset. Your code should print the resulting tibble from this operation.

Use filter() to subset the data to contain only coffee from Ethiopia (column country_of_origin) which has a total_cup_points score greater than 87. For this question, you’ll want to provide two arguments to filter() to achieve both conditions. Your code should print the resulting tibble from this operation.

Using a similar strategy to the previous question, answer the question: How many coffee farms in Guatemala have a total_cup_points score less than 70? The code should reveal a single number telling you how many rows there are in the subsetted data, aka how many farms. To achieve this, you should pipe into the function nrow() after filtering.
- Hint: Instead of nrow() you can use the dplyr function tally() if you want! Look it up with get_help("tally"). If you use this approach, your code will print a tibble containing the number answering the question - that’s totally fine!!

Using the same strategy from the previous question, write code that reveals a number answering the question: How many different Bourbon variety coffees (variety column) are from Brazil?

Again using the same strategy, answer: How many Robusta (hint, this is a species!) coffees are grown on the farm named “finca medina”? (Hint: there may not be that many……….)

Make a plot that shows the relationship between total_cup_points and balance, where points are colored by country_of_origin and balance is plotted across total_cup_points.
- For this question, and all other ggplot questions, you must pipe the dataset into the ggplot() function rather than directly providing the argument. In other words…
```
  # NO!!
  ggplot(coffee_ratings) + ...

  # YES!
  coffee_ratings %>%
    ggplot() + ...
    ...
```
- Hint: Yes, this plot is trash. Don’t put too much effort into de-trashing it. It’s just trash.

You may have noticed that, indeed, the previous plot is trash. There are two reasons for this:
- The dataset contains an outlier point where total_cup_points (and other point scores) are all 0! This is an error in the dataset where something wasn’t entered in properly. By exploring the data visually, we were able to detect this problem.
- TOO MANY COUNTRIES!! Including some NAs!! Having NA values isn’t necessarily something wrong with the data, but it’s not always helpful to include those data points when visualizing.
For this question, use the filter() function to remove rows whose total_cup_points equals 0 (aka, keep only rows where total_cup_points does not equal 0 - see get_help("logical") for help remembering the operator to use), and remove all rows whose country_of_origin is NA using the function drop_na(). Save this tibble to clean_coffee_ratings, and print it out. Note that you will use this tibble for the rest of the assignment!!.

Now, let’s remake the plot from question 6 to contain only TWO countries of your choosing. In a single pipeline, filter clean_coffee_ratings to these two countries of your choosing, and then pipe into ggplot() where you should again visualize balance across total_cup_points. Follow these specific instructions for making this plot:
- Use a colorblind-friendly non-default palette for the color mapping and ensure more professional (i.e., no underscores) axis labels.
- Set a non-default theme of your choosing
- Save the plot to a variable, and reveal the plot within the Rmarkdown.
- Then, save the plot to a PNG figure file inside the newly-created figures directory. The name of the file should be yourlastname_question8.png (i.e. mine would be spielman_question8.png). To make this happen, you’ll need to specify an argument to ggsave() with the path to the file. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code.
- You must ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.
- Hint: The most convenient way to accomplish this filtering is with the %in% logical operator (get_help("logical"))! Explore this demo code (copy/paste into Console!) to see how this could be accomplished:
```
# This works!
iris %>%
  filter(Species %in% c("setosa", "versicolor")) # keep only rows where Species is in that array

# THIS WILL NOT WORK PROPERLY with ==!! It will look like it works, but it will not actually work!
# It doesn;t work since Species never equals THAT ARRAY. Species is only one word.
iris %>%
  filter(Species == c("setosa", "versicolor")) 
```

How many unique combinations of species and variety are in this dataset? We’ll answer this question using some data wrangling. Take the following steps to create a pipeline from the clean_coffee_ratings dataset (not coffee_ratings!!):
- Use select() to only keep only columns variety and species.
- Use distinct() (a new dplyr function!) to retain only unique rows. This function doesn’t take any arguments - it’s just distinct()!
- Remove all NA values with drop_na().
- The number of remaining rows is the answer - pipe into either nrow() or tally() to arrive at the final product

Time to learn a new geom: geom_col(). Previously, we have learned that geom_bar() will plot (automatically!) the counts for a categorical variable. What if we literally just want to draw bars of a certain height where both X and Y-axes are mapped to a variable? We can use the other function geom_col() for this.

To see how this might work, we’ll also learn the dplyr function count(), which will count all categories in a categorical variable. Copy and paste this code as the first part of your answer’s pipeline:
```
clean_coffee_ratings %>%
  count(variety)
```
You’ll see a tibble with two columns: variety (all those categories) and a new column n which contains their count: How many times does that variety appear in the dataset?

Then continue building a pipeline with the next steps:
- Use filter() to keep only varieties that appear at least 50 times, aka whose n value is >= 50, and remove all NA values.
- Pipe everything into ggplot() where you should specify variety on the X-axis, n on the Y-axis, and use the geom_col() geom.
- This time, style this plot in ANY WAY you choose as long as axis labels look professional (the n is not professional)!!
- But, yet again, you must save the plot to a variable, and reveal the plot within the Rmarkdown, and then save the plot to a PNG figure file inside the newly-created figures directory. The name of the file should be yourlastname_question10.png. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.
- For bonus points: Learn and incorporate the forcats function fct_reorder() into your code so bars are shown in order of counts. See get_help("fct_reorder") or here.

Again starting from the dataset clean_coffee_ratings, use mutate() to make a new column in the dataset called moisture_decimal which contains the moisture score divided by 100. Then use select() to keep only the columns moisture and moisture_decimal.

Use mutate() to create a new column in the clean_coffee_ratings dataset called cup_rank where coffees with < 82.5 total_cup_points have the value “low”, and coffees with >= 82.5 (aka, otherwise) total_cup_points have the value “high”. Then use select() to keep only the columns total_cup_points, cup_rank, and aroma. Save this tibble to a new variable called coffee_ranks and print the tibble.
- This is the dramatic and exciting return of the ifelse() function! Note that dplyr has its own version of this function, if_else(), which you can use instead (see the difference with get_help("if_else"). Either of those functions is fine here.
- Engage with (copy/paste into Console and explore it!!) this example to see how you can approach this question. This is a fantastic approach for deriving a categorical variable from a numeric one!
```
iris %>%
  mutate(size_sepal_width = if_else(Sepal.Width > 3, "big", "small"))
```

Make a faceted histogram from this new dataset coffee_ranks that shows aroma distributions faceted by cup_rank. You will want to use facet_wrap() for this question (get_help("facet_wrap")!). Again pipe the dataset into ggplot(). Style this plot in any way of your choosing, but again…
- Save the plot to a variable, and reveal the plot within the Rmarkdown, and then save the plot to a PNG figure file inside the newly-created figures directory. The name of the file should be yourlastname_question13.png. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.

Make a pipeline that takes the following steps…
- Filter clean_coffee_ratings to only “Arabica” species.
- Filter to keep only these three countries (again, use %in%!!): Mexico, Colombia, and Guatemala.
- Use select() to keep only the columns country_of_origin, flavor and acidity.
- Use mutate() to create a new column called flavor_acidity_ratio that contains flavor divided by acidity.
- Save this tibble to a new variable called my_coffees and print it out.

Using your new my_coffees dataset, make two plots (one for this question, one for the next question) that show the relationship between flavor and acidity, where you are again piping the data into ggplot().

For this question, make a scatterplot to show the relationship between the variables flavor and acidity. For this plot….
- Plot flavor across acidity
- Facet by country
- Color points by country, and use a non-default color palette
- Include a separate linear trendline for each country.
- Ensure professional labels!
- And, rinse/repeat: Save the plot to a variable, and reveal the plot within the Rmarkdown, and then save the plot to a PNG figure file inside the newly-created figures directory. The name of the file should be yourlastname_question15.png. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.
In a brief 1-2 sentences below the revealed plot, provide a brief 1-2 sentence interpretation in markdown of this figure - what is the relationship between flavor and acidity, and is it different/same across the three countries?

You have arrived at the last question! Congratulations! Now, make a plot that visualizes the three distributions of flavor_acidity_ratio variable for each country. You can make any plot of your choosing as long as it allows you to compare among the three countries, and you can style any way you want as long as it looks professional. Pipe the data into ggplot(), and again….
- Save the plot to a variable, and reveal the plot within the Rmarkdown, and then save the plot to a PNG figure file inside the newly-created figures directory. The name of the file should be yourlastname_question16.png. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.
In a brief 1-2 sentences below the revealed plot, provide a brief 1-2 sentence interpretation in markdown of this figure - in this view of the data, how do you interpret the relationship between flavor and acidity across the countries? HINT: If the values for this ratio are generally greater than 1, that means the coffees have more acidity than flavor If the values for this ratio are generally less than 1, that means the coffees have more flavor than acidity. If the values for this ratio are roughly equal to 1, the coffees have roughly the same acidity and flavor levels.

No pressure but maybe you will want… A couple ggplot() things you might want when making this plot, just in case (not at all required to use these, but sometimes students ask about how to do this so here it is!):

Are you interested in adding a horizontal line or vertical line?

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point() + 
  # makes a horizontal line at y = 3
   geom_hline(yintercept = 3,  color = "red")

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point() + 
  # makes a vertical line at x = 6
  geom_vline(xintercept = 6, color = "red")

Do you remember stat_summary() to show mean +/- SE? It’s a really powerful and flexible function, actually.

ggplot(iris, aes(x = Species, y = Sepal.Length)) + 
  geom_violin(fill = "grey80") + 
  #let's have fun!
  stat_summary(geom = "point", # specifies to just make the summary a point
               fun = "mean",   # calculate and show mean (NOT default mean +/- se)
               shape = 21,     # when using this shape, ......
               size = 5,       # size specifies size of the point itself
               stroke = 2,     # stroke specifies size of the point outline  
               color = "black", 
               fill = "orange")

Instructions: Homework #6

Data Science for Biologists, Fall 2021