Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(7) # Launch Homework 7You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
In the Homework 7 template, you will see a chunk at the top whose name is paths_and_data. Here, you should do FOUR things (two of which are done for you) -
/cloud/project/homeworks/figures/. Your code should incorporate the functions here::here() and file.path()./cloud/project/homeworks/datasets/. Your code should incorporate the functions here::here() and file.path().coffee_ratings dataset directly from the URL where it lives on the internet (not from the file). This is done for you! Pretty convenient!clean_coffee_ratings, for the whole homework. Feel free to change the variable name if you like as long as it remains informative and not the same as the raw data coffee_ratings.All code should be formatted as pipelines with %>%!! This is REQUIRED. In addition, you should always pipe data into ggplot() when plotting. In other words…
# NO!!
ggplot(coffee_ratings) + ...
# YES!
coffee_ratings %>%
ggplot() + ...
...You can style your plots however you want as long as you ensure professional labeling!!
In addition, you must create your own Rmarkdown organization!! For each question, you should create a level-4 header with the question number and named code chunk without spaces to provide the answer in. Make sure to knit frequently to make sure the format is as you intend! Question 1’s formatting is started for you.
summarize() to calculate the average flavor score (column flavor) for all coffees. Your code should yield a tibble with a column called mean_flavor giving this information.
dplyr function summarize().variety) separately. Your code should yield a tibble with only these two columns: variety and mean_flavor.
dplyr function group_by() before using summarize().dplyr function arrange().mean_flavor. But this time, only calculate for varieties grown in Mexico that are not NA. Save the resulting tibble to a new variable called mexico_mean_flavor, and print the tibble out.
filter() somewhere in this pipeline! You’ll need to carefully consider where in the order of your pipeline the filtering should go - there’s only one place it “can” go for the code to work!NA values at some point as well with drop_na(), but you only want to remove rows where the variety is not known to avoid losing too much information.mexico_mean_flavor tibble you created in the previous question, make a bar plot of mean flavor across variety. Each bar’s height should be the mean flavor of the given Mexican coffee variety. More instructions…
ggplot2 for barplots: geom_bar() which counts rows of a categorical variable, and geom_col() which plots bars of a specific height mapped to a variable - literal bars, no counting!. Only one of those is suitable for use here!fct_reorder() (get_help("fct_reorder")) function as part of your code so the bars are automatically arranged in order of mean flavor./cloud/projects/homeworks/figures directory and saved as hw7_yourlastname_question5.png (i.e. mine would be hw7_spielman_question5.png). To make this happen, you’ll need to specify an argument to ggsave() with the path to the file. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. In addition, ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.species and country_of_origin (makes it easier to see and do the next steps)dplyr toolkit as a set of tools to be used in the right order when you need them. Sometimes order matters, and sometimes it does not! For this question, re-write your pipeline from the last question BUT SWITCH THE ORDER of select() and distinct() - first find all distinct() rows, and then select() your columns of interest, and finally arrange based on country. Look at the result from the previous question 7 and from this question, and answer in 1-2 markdown sentences: Which answer correctly identifies all unique combinations of species and country - question 6 or question 7 - and WHY? The “why” must consider how the order of your pipeline code affected the result.
number_of_bags) of Caturra variety coffee does each country make? Your code should yield a tibble with two columns:country_of_origin and total_bags (this is a variable your pipeline will create). To accomplish this…
group_by() and summarize() to determine the total number of bags of coffee each country produces, and find which country makes the most bags. After grouping on country, you will need to add up the total bags for each country - not count rows per country, but find the sum of the number_of_bags per country! You’ll therefore need to summarize with sum() to create the column total_bags.country_of_origin and total_bags and only five rows representing the five countries making the most Caturra coffee bags. To do this…
total_bags in descending order using arrange().dplyr function slice() to save only the top five rows. Teach yourself this function with get_help("slice"), and work through all examples! Some of the examples contain code exactly pertaining to what you have to do here!top_five_bags and print it out for this question.top_five_bags to a CSV file that should live in /cloud/project/homeworks/datasets (your defined variable path_to_datasets!). The file name should be yourlastname-top_five_bags.csv (i.e. mine is spielman-top_five_bags.csv; note the dash vs underscore! This is a quick lesson in helpful strategies for writing file names!).
readr function write_csv() which takes two arguments: first, the tibble you want to save to a file (top_five_bags). Second, the path to the file where the data should be saved. Your code must use the path_to_datasets variable and the file.path() function. You should check to make sure the file was created and contains the right data after you’ve run the code!!!top_five_bags, for this question you should read in this dataset from the final location where you saved it and just-recreate the dataset top_five_bags for practice! Your code should use the readr function read_csv() and incorporate the variable path_to_datasets as well as the file.path() function to read and save the CSV contents of the file to top_five_bags.top_five_bags data to make a barplot showing the number of bags for the top five countries (VERY similar to question 5!!). As usual, save this plot to a file inside /cloud/projects/homeworks/figures directory as hw7_yourlastname_question12.png (i.e. mine would be hw7_spielman_question12.png), using the previously defined variable path_to_figures and the function file.path() and setting an appropriate figure width and height.
geom_point() and geom_segment(), and I strongly recommend layering geom_point() ON TOP OF geom_segment(). Have fun exploring!! Note that the get_help("geom_segment") contains an example lollipop plot (if you get an error that this doc doesn’t exist, it means you still need to update your materials.)uniformity score of 10? Your code should yield a tibble with two columns (country_of_origin and number_of_coffees), and one row for each of those six countries. The tibble should further be arranged in order of number_of_coffees (ascending or descending is your choice!). To accomplish this..
central_countries that contains the 6 countries in Central Americacentral_countries variable, as well as coffees with a uniformity score of 10.dplyr function count()! The column created from count() should be called number_of_coffees (not the default n - see examples in get_help("count")!!)number_of_coffees.Create an array for the south american countries called south_countries, and remember that you already have an array for your central american countries!
Create another array called south_central that contains all values in both of those arrays, south_countries and central_countries. Simply combine arrays with c() to achieve this. For example:
array1 <- c("a", "b", "c")
array2 <- c("d", "e", "f")
#combine!
array_both <- c(array1, array2)
# reveal to see we have one array with all values:
array_both## [1] "a" "b" "c" "d" "e" "f"Now, you are ready for the pipeline:
south_central array and only coffees that are Arabica species.america_region whose value will be “central” if the country is central american and “south” otherwise. (Hint: use mutate(), if_else(), and %in%!!!)america_region groupings. The final answer should be a a tibble with two rows and two columns named america_region and median_balance.balance distributions.
ggforce library with the geom geom_sina(). You will need to load the ggforce library in the setup chunk (not in this chunk!!!) for your final submission./cloud/project/homeworks/figures directory named hw7_yourlastname_question15.png.total_cup_points score for each of the three regions of the United States. In the dataset, these are recorded as three separate countries: "United States", "United States (Puerto Rico)", and "United States (Hawaii)". Your final result should be a tibble with two columns country_of_origin and minimum_points (containing the minimum cup points) and three rows, one for each region.
min() is a summary statistics function (get_help("min")) just like mean() or median(). We can summarize with it!usa_countries when filtering with %in%, very similar to the previous questions with central and south america arrays.
%in% logical operator when filtering and do not define that array. Instead, teach yourself the function str_detect() (get_help("str_detect")), which comes from the tidyverse library stringr. This library is used to work with strings. You should use this function as part of your filter() code instead of %in%. Again this is only optional bonus!If you want to tackle this, add this to the template with the header
#### Bonus questionand chunk namebonus_question. Code attempts may get partial credit as long as there are no errors!
Reconsider the data from question #4 and question #5 where you created a barplot of mean flavors in Mexican coffee varieties. For a bonus question, teach yourself to use another geom geom_text() to create a modified version of that barplot. This geom needs x, y, and label specifications (which can either be aes() mappings or “just” values for writing labels). The goal is to again make a barplot showing the mean flavors in Mexico, but we also want to include a label above each bar giving the literal mean value.
geom_text() with get_help("geom_text")!! You may need to update your materials for this to work as we have seen in class. If you prefer to use geom_label() (as is seen in the introverse help), please feel free!geom_text() labels to look professional./cloud/projects/homeworks/figures directory and saved as hw7_yourlastname_bonus.png (i.e. mine would be hw7_spielman_bonus.png). To make this happen, you’ll need to specify an argument to ggsave() with the path to the file. You MUST USE the previously defined variable path_to_figures and the function file.path(). in your code. In addition, ensure the figure is saved at a reasonable aspect ratio by specifying width and height arguments when using ggsave(). The defaults are not reliable.