The actual instructions

Part 1: Defining paths you will need for the homework

In the Homework 7 template, you will see a chunk at the top whose name is paths_and_data. Here, you should do FOUR things (two of which are done for you) -

Define a path to the directory /cloud/project/homeworks/figures/. Your code should incorporate the functions here::here() and file.path().
Define a path to the directory /cloud/project/homeworks/datasets/. Your code should incorporate the functions here::here() and file.path().
Read in the coffee_ratings dataset directly from the URL where it lives on the internet (not from the file). This is done for you! Pretty convenient!
Since we know there is an error in this dataset where 1 row has a point score of 0, let’s remove it before proceeding. This is also done for you! Pretty convenient! You should use this tibble, clean_coffee_ratings, for the whole homework. Feel free to change the variable name if you like as long as it remains informative and not the same as the raw data coffee_ratings.

Part 2: Questions

All code should be formatted as pipelines with %>%!! This is REQUIRED. In addition, you should always pipe data into ggplot() when plotting. In other words…

# NO!!
ggplot(coffee_ratings) + ...

# YES!
coffee_ratings %>%
  ggplot() + ...
  ...

You can style your plots however you want as long as you ensure professional labeling!!

In addition, you must create your own Rmarkdown organization!! For each question, you should create a level-4 header with the question number and named code chunk without spaces to provide the answer in. Make sure to knit frequently to make sure the format is as you intend! Question 1’s formatting is started for you.

Use summarize() to calculate the average flavor score (column flavor) for all coffees. Your code should yield a tibble with a column called mean_flavor giving this information.
- Hint: use the dplyr function summarize().

Modify your code from the previous question to instead calculate the mean flavor of each variety (column variety) separately. Your code should yield a tibble with only these two columns: variety and mean_flavor.
- Hint: use the dplyr function group_by() before using summarize().

Modify your code from the previous question to again calculate mean flavors across varieties, but now arrange the final tibble in ascending order of mean flavor.
- Hint: include the dplyr function arrange().

Yet again, modify your code from the previous question to again calculate mean flavors across varieties where the final output is arranged in order of mean_flavor. But this time, only calculate for varieties grown in Mexico that are not NA. Save the resulting tibble to a new variable called mexico_mean_flavor, and print the tibble out.
- Hint: you will need to use filter() somewhere in this pipeline! You’ll need to carefully consider where in the order of your pipeline the filtering should go - there’s only one place it “can” go for the code to work!
- Another hint: You’ll need to remove NA values at some point as well with drop_na(), but you only want to remove rows where the variety is not known to avoid losing too much information.
- Final hint: If done correctly, the resulting tibble will have ten rows.

Using the mexico_mean_flavor tibble you created in the previous question, make a bar plot of mean flavor across variety. Each bar’s height should be the mean flavor of the given Mexican coffee variety. More instructions…
- Remember that there are two functions in ggplot2 for barplots: geom_bar() which counts rows of a categorical variable, and geom_col() which plots bars of a specific height mapped to a variable - literal bars, no counting!. Only one of those is suitable for use here!
- Again, you can style the plot any way you like as long as the plot adheres to best practices, including professional labeling without underscores.
- For some bonus points, make a horizontal barplot instead of a vertical one.
- For even more bonus points, learn and use the fct_reorder() (get_help("fct_reorder")) function as part of your code so the bars are automatically arranged in order of mean flavor.
- You must save the plot to a variable, reveal the plot in the Rmarkdown, AND save it to a file. The final file should be inside the /cloud/projects/homeworks/figures directory and saved as hw7_yourlastname_question5.png (i.e. mine would be hw7_spielman_question5.png). To make this happen, you’ll need to specify an argument to ggsave() with the path to the file. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. In addition, ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.

Use wrangling to find all unique distinct combinations of species and country in this dataset. To do this…
- First, select only the relevant columns which are species and country_of_origin (makes it easier to see and do the next steps)
- Second, keep only distinct rows
- Remove any NAs that remain in the data
- Arrange rows alphabetically based on country for nicer viewing

This question is prompting you to think about how important is to think of your dplyr toolkit as a set of tools to be used in the right order when you need them. Sometimes order matters, and sometimes it does not! For this question, re-write your pipeline from the last question BUT SWITCH THE ORDER of select() and distinct() - first find all distinct() rows, and then select() your columns of interest, and finally arrange based on country. Look at the result from the previous question 7 and from this question, and answer in 1-2 markdown sentences: Which answer correctly identifies all unique combinations of species and country - question 6 or question 7 - and WHY? The “why” must consider how the order of your pipeline code affected the result.
- Hint: You may/will want to run each pipeline line one at a time line by line to compare!!

Use your wrangling skills to answer this question: How many coffee bags (number_of_bags) of Caturra variety coffee does each country make? Your code should yield a tibble with two columns:country_of_origin and total_bags (this is a variable your pipeline will create). To accomplish this…
- First, filter to the rows that are needed for calculations (Caturra varieties only)
- Use group_by() and summarize() to determine the total number of bags of coffee each country produces, and find which country makes the most bags. After grouping on country, you will need to add up the total bags for each country - not count rows per country, but find the sum of the number_of_bags per country! You’ll therefore need to summarize with sum() to create the column total_bags.

Modify code from the previous question to answer this question: Which five countries produce the most Caturra variety coffee bags? Your code should yield a tibble with two columns:country_of_origin and total_bags and only five rows representing the five countries making the most Caturra coffee bags. To do this…
- Add a step onto the pipeline to arrange total_bags in descending order using arrange().
- Then, use the new and exciting!! dplyr function slice() to save only the top five rows. Teach yourself this function with get_help("slice"), and work through all examples! Some of the examples contain code exactly pertaining to what you have to do here!
- You should save the result to a tibble called top_five_bags and print it out for this question.

For this question, teach yourself how to write datasets to CSV files. You should save the dataset top_five_bags to a CSV file that should live in /cloud/project/homeworks/datasets (your defined variable path_to_datasets!). The file name should be yourlastname-top_five_bags.csv (i.e. mine is spielman-top_five_bags.csv; note the dash vs underscore! This is a quick lesson in helpful strategies for writing file names!).
- To accomplish this, you should use the readr function write_csv() which takes two arguments: first, the tibble you want to save to a file (top_five_bags). Second, the path to the file where the data should be saved. Your code must use the path_to_datasets variable and the file.path() function. You should check to make sure the file was created and contains the right data after you’ve run the code!!!

Now that you’ve written the CSV to a file in question 11, let’s remind ourselves how to read in data. Even though this dataset is already inside R as top_five_bags, for this question you should read in this dataset from the final location where you saved it and just-recreate the dataset top_five_bags for practice! Your code should use the readr function read_csv() and incorporate the variable path_to_datasets as well as the file.path() function to read and save the CSV contents of the file to top_five_bags.

After that reading/writing file detour, let’s get back on track to plotting. For this question, use the top_five_bags data to make a barplot showing the number of bags for the top five countries (VERY similar to question 5!!). As usual, save this plot to a file inside /cloud/projects/homeworks/figures directory as hw7_yourlastname_question12.png (i.e. mine would be hw7_spielman_question12.png), using the previously defined variable path_to_figures and the function file.path() and setting an appropriate figure width and height.
- Alternatively, for BONUS POINTS, don’t make a barplot but make a lollipop plot instead. This is basically a barplot except it looks like a lollipop instead of a bar! You will need to combine the geoms geom_point() and geom_segment(), and I strongly recommend layering geom_point() ON TOP OF geom_segment(). Have fun exploring!! Note that the get_help("geom_segment") contains an example lollipop plot (if you get an error that this doc doesn’t exist, it means you still need to update your materials.)

Consider only countries in Central America (Guatemala, Nicaragua, Panama, Honduras, Costa Rica, and El Salvador). How many coffees in each of those countries have a uniformity score of 10? Your code should yield a tibble with two columns (country_of_origin and number_of_coffees), and one row for each of those six countries. The tibble should further be arranged in order of number_of_coffees (ascending or descending is your choice!). To accomplish this..
- First, create an array called central_countries that contains the 6 countries in Central America
- Then, write a pipeline to filter to keep only these countries, where your code uses that central_countries variable, as well as coffees with a uniformity score of 10.
- At this stage in the pipeline, you have found all the relevant rows. Now you need to figure out how many rows there are per country, which you can accomplish with the dplyr function count()! The column created from count() should be called number_of_coffees (not the default n - see examples in get_help("count")!!)
- Finally, now that everything is calculated, arrange in order of number_of_coffees.

Compare all Central American Arabica coffees to South American Arabica (Brazil, Colombia, Peru, and Ecuador) coffees to answer the question: Whose Arabica coffees have a higher median balance, Central or South America? To accomplish this…
- Create an array for the south american countries called south_countries, and remember that you already have an array for your central american countries!
- Create another array called south_central that contains all values in both of those arrays, south_countries and central_countries. Simply combine arrays with c() to achieve this. For example:
```
array1 <- c("a", "b", "c")
array2 <- c("d", "e", "f")

#combine!
array_both <- c(array1, array2)

# reveal to see we have one array with all values:
array_both
```
```
## [1] "a" "b" "c" "d" "e" "f"
```
- Now, you are ready for the pipeline:
  - Keep only countries that appear in your south_central array and only coffees that are Arabica species.
  - Create a new column called america_region whose value will be “central” if the country is central american and “south” otherwise. (Hint: use mutate(), if_else(), and %in%!!!)
  - Find the median balance across america_region groupings. The final answer should be a a tibble with two rows and two columns named america_region and median_balance.

For this question, write code that is very similar to the previous question 14. You want to visualize the full distributions of balance values for each region in America. This shows another way we can get a sense of balance values - rather than just looking at their medians, how about their full distributions? To accomplish this…
- Follow the same steps to filter and mutate. Since you are not going to proceed to summarize, no need to group.
- Then, directly pipe into ggplot to make one of a boxplot, jitter plot, violin plot, overlapping density plot, or faceted histogram to see the two balance distributions.
  - For bonus points, make a sina plot using the ggforce library with the geom geom_sina(). You will need to load the ggforce library in the setup chunk (not in this chunk!!!) for your final submission.
  - For more bonus points, layer two geoms on top of each other in a way that nicely conveys the data. For example, a jitter on top of a boxplot? A sina on top of a violin? Enjoy!
- Again, save this plot to a file in the usual way, in the /cloud/project/homeworks/figures directory named hw7_yourlastname_question15.png.

Congratulations, you’ve reached the last question! The goal for this question is to find the minimum total_cup_points score for each of the three regions of the United States. In the dataset, these are recorded as three separate countries: "United States", "United States (Puerto Rico)", and "United States (Hawaii)". Your final result should be a tibble with two columns country_of_origin and minimum_points (containing the minimum cup points) and three rows, one for each region.
- Hint: You’ll need to filter to countries of interest, group by country, and summarize the minimum value of points. min() is a summary statistics function (get_help("min")) just like mean() or median(). We can summarize with it!
- You MUST define an array containing all the USA country names called usa_countries when filtering with %in%, very similar to the previous questions with central and south america arrays.
  - Alternatively for optional bonus points, do not use the %in% logical operator when filtering and do not define that array. Instead, teach yourself the function str_detect() (get_help("str_detect")), which comes from the tidyverse library stringr. This library is used to work with strings. You should use this function as part of your filter() code instead of %in%. Again this is only optional bonus!

Optional bonus question!

If you want to tackle this, add this to the template with the header #### Bonus question and chunk name bonus_question. Code attempts may get partial credit as long as there are no errors!

Reconsider the data from question #4 and question #5 where you created a barplot of mean flavors in Mexican coffee varieties. For a bonus question, teach yourself to use another geom geom_text() to create a modified version of that barplot. This geom needs x, y, and label specifications (which can either be aes() mappings or “just” values for writing labels). The goal is to again make a barplot showing the mean flavors in Mexico, but we also want to include a label above each bar giving the literal mean value.

Hint: Learn about geom_text() with get_help("geom_text")!! You may need to update your materials for this to work as we have seen in class. If you prefer to use geom_label() (as is seen in the introverse help), please feel free!
Another hint: You may also want to tweak the size of the geom_text() labels to look professional.
Again, you must save the plot to a variable, reveal the plot in the Rmarkdown, AND save it to a file. The final file should be inside the /cloud/projects/homeworks/figures directory and saved as hw7_yourlastname_bonus.png (i.e. mine would be hw7_spielman_bonus.png). To make this happen, you’ll need to specify an argument to ggsave() with the path to the file. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. In addition, ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.

Instructions: Homework #7

Data Science for Biologists, Fall 2021

Complete the template Rmd and submit to Canvas on Friday 10/29/21 by 11:59 PM

Obtaining and setting up the homework

The actual instructions

Part 1: Defining paths you will need for the homework

Part 2: Questions

Optional bonus question!