Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(7) # Launch Homework 7
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
In the Homework 7 template, you will see a chunk at the top whose name is paths_and_data
. Here, you should do FOUR things (two of which are done for you) -
/cloud/project/homeworks/figures/
. Your code should incorporate the functions here::here()
and file.path()
./cloud/project/homeworks/datasets/
. Your code should incorporate the functions here::here()
and file.path()
.coffee_ratings
dataset directly from the URL where it lives on the internet (not from the file). This is done for you! Pretty convenient!clean_coffee_ratings
, for the whole homework. Feel free to change the variable name if you like as long as it remains informative and not the same as the raw data coffee_ratings
.All code should be formatted as pipelines with %>%
!! This is REQUIRED. In addition, you should always pipe data into ggplot()
when plotting. In other words…
# NO!!
ggplot(coffee_ratings) + ...
# YES!
%>%
coffee_ratings ggplot() + ...
...
You can style your plots however you want as long as you ensure professional labeling!!
In addition, you must create your own Rmarkdown organization!! For each question, you should create a level-4 header with the question number and named code chunk without spaces to provide the answer in. Make sure to knit frequently to make sure the format is as you intend! Question 1’s formatting is started for you.
summarize()
to calculate the average flavor score (column flavor
) for all coffees. Your code should yield a tibble with a column called mean_flavor
giving this information.
dplyr
function summarize()
.variety
) separately. Your code should yield a tibble with only these two columns: variety
and mean_flavor
.
dplyr
function group_by()
before using summarize()
.dplyr
function arrange()
.mean_flavor
. But this time, only calculate for varieties grown in Mexico that are not NA
. Save the resulting tibble to a new variable called mexico_mean_flavor
, and print the tibble out.
filter()
somewhere in this pipeline! You’ll need to carefully consider where in the order of your pipeline the filtering should go - there’s only one place it “can” go for the code to work!NA
values at some point as well with drop_na()
, but you only want to remove rows where the variety
is not known to avoid losing too much information.mexico_mean_flavor
tibble you created in the previous question, make a bar plot of mean flavor across variety. Each bar’s height should be the mean flavor of the given Mexican coffee variety. More instructions…
ggplot2
for barplots: geom_bar()
which counts rows of a categorical variable, and geom_col()
which plots bars of a specific height mapped to a variable - literal bars, no counting!. Only one of those is suitable for use here!fct_reorder()
(get_help("fct_reorder")
) function as part of your code so the bars are automatically arranged in order of mean flavor./cloud/projects/homeworks/figures
directory and saved as hw7_yourlastname_question5.png
(i.e. mine would be hw7_spielman_question5.png
). To make this happen, you’ll need to specify an argument to ggsave()
with the path to the file. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code. In addition, ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable.species
and country_of_origin
(makes it easier to see and do the next steps)dplyr
toolkit as a set of tools to be used in the right order when you need them. Sometimes order matters, and sometimes it does not! For this question, re-write your pipeline from the last question BUT SWITCH THE ORDER of select()
and distinct()
- first find all distinct()
rows, and then select()
your columns of interest, and finally arrange based on country. Look at the result from the previous question 7 and from this question, and answer in 1-2 markdown sentences: Which answer correctly identifies all unique combinations of species and country - question 6 or question 7 - and WHY? The “why” must consider how the order of your pipeline code affected the result.
number_of_bags
) of Caturra variety coffee does each country make? Your code should yield a tibble with two columns:country_of_origin
and total_bags
(this is a variable your pipeline will create). To accomplish this…
group_by()
and summarize()
to determine the total number of bags of coffee each country produces, and find which country makes the most bags. After grouping on country, you will need to add up the total bags for each country - not count rows per country, but find the sum of the number_of_bags
per country! You’ll therefore need to summarize with sum()
to create the column total_bags
.country_of_origin
and total_bags
and only five rows representing the five countries making the most Caturra coffee bags. To do this…
total_bags
in descending order using arrange()
.dplyr
function slice()
to save only the top five rows. Teach yourself this function with get_help("slice")
, and work through all examples! Some of the examples contain code exactly pertaining to what you have to do here!top_five_bags
and print it out for this question.top_five_bags
to a CSV file that should live in /cloud/project/homeworks/datasets
(your defined variable path_to_datasets
!). The file name should be yourlastname-top_five_bags.csv
(i.e. mine is spielman-top_five_bags.csv
; note the dash vs underscore! This is a quick lesson in helpful strategies for writing file names!).
readr
function write_csv()
which takes two arguments: first, the tibble you want to save to a file (top_five_bags
). Second, the path to the file where the data should be saved. Your code must use the path_to_datasets
variable and the file.path()
function. You should check to make sure the file was created and contains the right data after you’ve run the code!!!top_five_bags
, for this question you should read in this dataset from the final location where you saved it and just-recreate the dataset top_five_bags
for practice! Your code should use the readr
function read_csv()
and incorporate the variable path_to_datasets
as well as the file.path()
function to read and save the CSV contents of the file to top_five_bags
.top_five_bags
data to make a barplot showing the number of bags for the top five countries (VERY similar to question 5!!). As usual, save this plot to a file inside /cloud/projects/homeworks/figures
directory as hw7_yourlastname_question12.png
(i.e. mine would be hw7_spielman_question12.png
), using the previously defined variable path_to_figures
and the function file.path()
and setting an appropriate figure width and height.
geom_point()
and geom_segment()
, and I strongly recommend layering geom_point()
ON TOP OF geom_segment()
. Have fun exploring!! Note that the get_help("geom_segment")
contains an example lollipop plot (if you get an error that this doc doesn’t exist, it means you still need to update your materials.)uniformity
score of 10? Your code should yield a tibble with two columns (country_of_origin
and number_of_coffees
), and one row for each of those six countries. The tibble should further be arranged in order of number_of_coffees
(ascending or descending is your choice!). To accomplish this..
central_countries
that contains the 6 countries in Central Americacentral_countries
variable, as well as coffees with a uniformity
score of 10.dplyr
function count()
! The column created from count()
should be called number_of_coffees
(not the default n
- see examples in get_help("count")
!!)number_of_coffees
.Create an array for the south american countries called south_countries
, and remember that you already have an array for your central american countries!
Create another array called south_central
that contains all values in both of those arrays, south_countries
and central_countries
. Simply combine arrays with c()
to achieve this. For example:
<- c("a", "b", "c")
array1 <- c("d", "e", "f")
array2
#combine!
<- c(array1, array2)
array_both
# reveal to see we have one array with all values:
array_both
## [1] "a" "b" "c" "d" "e" "f"
Now, you are ready for the pipeline:
south_central
array and only coffees that are Arabica
species.america_region
whose value will be “central” if the country is central american and “south” otherwise. (Hint: use mutate()
, if_else()
, and %in%
!!!)america_region
groupings. The final answer should be a a tibble with two rows and two columns named america_region
and median_balance
.balance
distributions.
ggforce
library with the geom geom_sina()
. You will need to load the ggforce
library in the setup chunk (not in this chunk!!!) for your final submission./cloud/project/homeworks/figures
directory named hw7_yourlastname_question15.png
.total_cup_points
score for each of the three regions of the United States. In the dataset, these are recorded as three separate countries: "United States"
, "United States (Puerto Rico)"
, and "United States (Hawaii)"
. Your final result should be a tibble with two columns country_of_origin
and minimum_points
(containing the minimum cup points) and three rows, one for each region.
min()
is a summary statistics function (get_help("min")
) just like mean()
or median()
. We can summarize with it!usa_countries
when filtering with %in%
, very similar to the previous questions with central and south america arrays.
%in%
logical operator when filtering and do not define that array. Instead, teach yourself the function str_detect()
(get_help("str_detect")
), which comes from the tidyverse
library stringr
. This library is used to work with strings. You should use this function as part of your filter()
code instead of %in%
. Again this is only optional bonus!If you want to tackle this, add this to the template with the header
#### Bonus question
and chunk namebonus_question
. Code attempts may get partial credit as long as there are no errors!
Reconsider the data from question #4 and question #5 where you created a barplot of mean flavors in Mexican coffee varieties. For a bonus question, teach yourself to use another geom geom_text()
to create a modified version of that barplot. This geom needs x
, y
, and label
specifications (which can either be aes()
mappings or “just” values for writing labels). The goal is to again make a barplot showing the mean flavors in Mexico, but we also want to include a label above each bar giving the literal mean value.
geom_text()
with get_help("geom_text")
!! You may need to update your materials for this to work as we have seen in class. If you prefer to use geom_label()
(as is seen in the introverse help), please feel free!geom_text()
labels to look professional./cloud/projects/homeworks/figures
directory and saved as hw7_yourlastname_bonus.png
(i.e. mine would be hw7_spielman_bonus.png
). To make this happen, you’ll need to specify an argument to ggsave()
with the path to the file. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code. In addition, ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable.