Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(6) # Launch Homework 6
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
For this homework, you will be…
dplyr
and tidyr
packages for data wrangling. These packages are part of the core tidyverse and are loaded when you run library(tidyverse)
.
%>%
operator a whole bunch (get_help("pipe")
)activities/rmarkdown_demonstration.Rmd
document! Don’t have the document? Load the ds4b.materials
library and run launch_activity("rmarkdown")
and see class recording from Thursday 9/16/21.On the class website, there is a CSV file you can download called coffee_ratings.csv
which contains the dataset documented here. Take a few minutes to read about this dataset before you begin.
To begin the homework, you need to take the following steps:
homeworks/
directory called figures/
and datasets/
. In other words you are creating directories that will end up having these full paths:
/cloud/project/homeworks/figures/
/cloud/project/homeworks/datasets/
coffee_ratings.txt
instead of coffee_ratings.csv
. THAT’S OK!! You should use whatever extension it downloaded as when you are referring to the file name! It will not affect the file or the procedure to read it in. Just make sure to refer to the file name that yours downloaded as. If you are feeling fancy, you can always rename it to coffee_ratings.csv
once it’s in the cloud (keep reading!) using the “Rename” button./cloud/project/homeworks/datasets/
In the Homework 6 template, you will see a chunk at the top whose name is paths_and_data
. Here, we want to define paths to certain directories and files this homework uses. However, there’s a major catch: When we are interactively working in Console, the default working directory in RStudio Cloud is /cloud/project/
. But, when you click the knit button for an Rmd file, the knitting engine assumes the working directory is where ever the Rmd file is saved, which in this case is /cloud/project/homeworks/
. This leads to very frustrating conflicts that bothers pretty much everyone working in R!!
The solution arrives to us from the amazing {here}
package, which provides help clarifying paths in these annoying situations. The package has a function called here()
(same name!) which is typically called as here::here()
(following syntax package::function()
) that automatically finds the “top-level directory” of an R project on whichever machine it is being run on!! It’s super helpful.
Returning to the paths_and_data
chunk, you will need to create three variables using the file.path()
function:
path_to_datasets
, which should be the path to the homeworks/datasets/
directory (this one has been done for you!)path_to_figures
, which should be the path to the homeworks/figures/
directorycoffee_ratings_file
, which should be the path to the coffee_ratings.csv
(or coffee_ratings.txt
if that’s how yours saved) file you saved in the new homeworks/datasets/
directory. This variable definition must use the path_to_datasets
variable.Then, in this chunk, you should read in the coffee ratings dataset with read_csv()
and save it to a variable coffee_ratings
for use in the assignment.
All code should be formatted as pipelines with %>%
!! This is REQUIRED.
filter()
to subset the data to contain only “Arabica” (column species
) coffees in the dataset. Your code should print the resulting tibble from this operation.filter()
to subset the data to contain only coffee from Ethiopia (column country_of_origin
) which has a total_cup_points
score greater than 87. For this question, you’ll want to provide two arguments to filter()
to achieve both conditions. Your code should print the resulting tibble from this operation.Using a similar strategy to the previous question, answer the question: How many coffee farms in Guatemala have a total_cup_points
score less than 70? The code should reveal a single number telling you how many rows there are in the subsetted data, aka how many farms. To achieve this, you should pipe into the function nrow()
after filtering.
nrow()
you can use the dplyr
function tally()
if you want! Look it up with get_help("tally")
. If you use this approach, your code will print a tibble containing the number answering the question - that’s totally fine!!variety
column) are from Brazil?species
!) coffees are grown on the farm named “finca medina”? (Hint: there may not be that many……….)Make a plot that shows the relationship between total_cup_points
and balance
, where points are colored by country_of_origin
and balance
is plotted across total_cup_points
.
For this question, and all other ggplot questions, you must pipe the dataset into the ggplot()
function rather than directly providing the argument. In other words…
# NO!!
ggplot(coffee_ratings) + ...
# YES!
%>%
coffee_ratings ggplot() + ...
...
Hint: Yes, this plot is trash. Don’t put too much effort into de-trashing it. It’s just trash.
total_cup_points
(and other point scores) are all 0! This is an error in the dataset where something wasn’t entered in properly. By exploring the data visually, we were able to detect this problem.NA
s!! Having NA
values isn’t necessarily something wrong with the data, but it’s not always helpful to include those data points when visualizing. filter()
function to remove rows whose total_cup_points
equals 0 (aka, keep only rows where total_cup_points
does not equal 0 - see get_help("logical")
for help remembering the operator to use), and remove all rows whose country_of_origin
is NA
using the function drop_na()
. Save this tibble to clean_coffee_ratings
, and print it out. Note that you will use this tibble for the rest of the assignment!!.clean_coffee_ratings
to these two countries of your choosing, and then pipe into ggplot()
where you should again visualize balance
across total_cup_points
. Follow these specific instructions for making this plot:
Use a colorblind-friendly non-default palette for the color mapping and ensure more professional (i.e., no underscores) axis labels.
Set a non-default theme of your choosing
Save the plot to a variable, and reveal the plot within the Rmarkdown.
Then, save the plot to a PNG figure file inside the newly-created figures
directory. The name of the file should be yourlastname_question8.png
(i.e. mine would be spielman_question8.png
). To make this happen, you’ll need to specify an argument to ggsave()
with the path to the file. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code.
You must ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable.
Hint: The most convenient way to accomplish this filtering is with the %in%
logical operator (get_help("logical")
)! Explore this demo code (copy/paste into Console!) to see how this could be accomplished:
# This works!
%>%
iris filter(Species %in% c("setosa", "versicolor")) # keep only rows where Species is in that array
# THIS WILL NOT WORK PROPERLY with ==!! It will look like it works, but it will not actually work!
# It doesn;t work since Species never equals THAT ARRAY. Species is only one word.
%>%
iris filter(Species == c("setosa", "versicolor"))
species
and variety
are in this dataset? We’ll answer this question using some data wrangling. Take the following steps to create a pipeline from the clean_coffee_ratings
dataset (not coffee_ratings
!!):
select()
to only keep only columns variety
and species
.distinct()
(a new dplyr
function!) to retain only unique rows. This function doesn’t take any arguments - it’s just distinct()
!NA
values with drop_na()
.nrow()
or tally()
to arrive at the final productTime to learn a new geom: geom_col()
. Previously, we have learned that geom_bar()
will plot (automatically!) the counts for a categorical variable. What if we literally just want to draw bars of a certain height where both X and Y-axes are mapped to a variable? We can use the other function geom_col()
for this.
To see how this might work, we’ll also learn the dplyr
function count()
, which will count all categories in a categorical variable. Copy and paste this code as the first part of your answer’s pipeline:
%>%
clean_coffee_ratings count(variety)
You’ll see a tibble with two columns: variety
(all those categories) and a new column n
which contains their count: How many times does that variety appear in the dataset?
Then continue building a pipeline with the next steps:
filter()
to keep only varieties that appear at least 50 times, aka whose n
value is >= 50
, and remove all NA
values.ggplot()
where you should specify variety
on the X-axis, n
on the Y-axis, and use the geom_col()
geom.n
is not professional)!!figures
directory. The name of the file should be yourlastname_question10.png
. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable.forcats
function fct_reorder()
into your code so bars are shown in order of counts. See get_help("fct_reorder")
or here.clean_coffee_ratings
, use mutate()
to make a new column in the dataset called moisture_decimal
which contains the moisture score divided by 100. Then use select()
to keep only the columns moisture
and moisture_decimal
.mutate()
to create a new column in the clean_coffee_ratings
dataset called cup_rank
where coffees with < 82.5 total_cup_points
have the value “low”, and coffees with >= 82.5 (aka, otherwise) total_cup_points
have the value “high”. Then use select()
to keep only the columns total_cup_points
, cup_rank
, and aroma
. Save this tibble to a new variable called coffee_ranks
and print the tibble.
This is the dramatic and exciting return of the ifelse()
function! Note that dplyr
has its own version of this function, if_else()
, which you can use instead (see the difference with get_help("if_else")
. Either of those functions is fine here.
Engage with (copy/paste into Console and explore it!!) this example to see how you can approach this question. This is a fantastic approach for deriving a categorical variable from a numeric one!
%>%
iris mutate(size_sepal_width = if_else(Sepal.Width > 3, "big", "small"))
coffee_ranks
that shows aroma
distributions faceted by cup_rank
. You will want to use facet_wrap()
for this question (get_help("facet_wrap")
!). Again pipe the dataset into ggplot()
. Style this plot in any way of your choosing, but again…
figures
directory. The name of the file should be yourlastname_question13.png
. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable.clean_coffee_ratings
to only “Arabica” species.%in%
!!): Mexico, Colombia, and Guatemala.select()
to keep only the columns country_of_origin
, flavor
and acidity
.mutate()
to create a new column called flavor_acidity_ratio
that contains flavor divided by acidity.my_coffees
and print it out.Using your new my_coffees
dataset, make two plots (one for this question, one for the next question) that show the relationship between flavor and acidity, where you are again piping the data into ggplot()
.
For this question, make a scatterplot to show the relationship between the variables flavor
and acidity
. For this plot….
flavor
across acidity
figures
directory. The name of the file should be yourlastname_question15.png
. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable.In a brief 1-2 sentences below the revealed plot, provide a brief 1-2 sentence interpretation in markdown of this figure - what is the relationship between flavor and acidity, and is it different/same across the three countries?
flavor_acidity_ratio
variable for each country. You can make any plot of your choosing as long as it allows you to compare among the three countries, and you can style any way you want as long as it looks professional. Pipe the data into ggplot()
, and again….
figures
directory. The name of the file should be yourlastname_question16.png
. You MUST USE the previously defined variable path_to_figures
and the function file.path()
. in your code. You must ensure the figure is saved at a reasonable aspect ratio by specifying width
and height
arguments when using ggsave()
. The defaults are not reliable. No pressure but maybe you will want… A couple ggplot()
things you might want when making this plot, just in case (not at all required to use these, but sometimes students ask about how to do this so here it is!):
Are you interested in adding a horizontal line or vertical line?
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
# makes a horizontal line at y = 3
geom_hline(yintercept = 3, color = "red")
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
# makes a vertical line at x = 6
geom_vline(xintercept = 6, color = "red")
Do you remember stat_summary()
to show mean +/- SE? It’s a really powerful and flexible function, actually.
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
geom_violin(fill = "grey80") +
#let's have fun!
stat_summary(geom = "point", # specifies to just make the summary a point
fun = "mean", # calculate and show mean (NOT default mean +/- se)
shape = 21, # when using this shape, ......
size = 5, # size specifies size of the point itself
stroke = 2, # stroke specifies size of the point outline
color = "black",
fill = "orange")