dplyr
To do these exercises, you will need to the tidyverse
library. Create a script so you can save your work for future reference, and start with this code to begin:
library(tidyverse)
library(introverse) # if you want :)
These exercises use a pre-loaded dataset called msleep
(sound familiar?), which provides different physical and behavioral characteristics of mammals, including how much they sleep.
Take some time to familiarize dataset before you proceed to work with it, using functions like head()
, names()
, etc. In particular, there is a great dplyr
function glimpse()
which can reveal a lot of helpful information about a data frame. It’s similar to the str()
function, but the output is much nicer to look at. Run it to see!
glimpse(msleep)
dplyr
verbsfilter()
filter()
to subset msleep
to only herbivores.
vore
column will tell you if a mammal is an herbivore. You want find all rows where vore == "herbi"
.vore == herbi
? … "herbi"
is a STRING, not a variable!%>%
msleep filter(vore == "herbi")
filter()
to subset msleep
to only animals who are awake for at least 12 hours of the day.
>=
%>%
msleep filter(awake >= 12)
filter()
to subset msleep
to only herbivores who are awake for at least 12 hours of the days.
dplyr
makes it easy to supply “and” conditions to filter()
simply with commas: filter(statement1, statement2)
. This is “the same” as writing: filter(statement1 & statement2)
.%>%
msleep filter(vore == "herbi", awake >= 12)
filter()
to subset msleep
to include only herbivores and insectivores, using the %in%
logical operator to help craft the statement. Engage with and understand the code, and modify it to subset the data to keep only herbivores and carnivores.
%in%
operator as in: filter(column %in% c(thing_i_want, other_thing_i_want))
. In this case you want the vore
to be something in this array: c("herbi", "insecti")
|
symbol to ask if vore
equal “herbi” or if vore
equals “insecti”msleep %>%
filter(vore %in% c("herbi", "insecti"))
%>%
msleep filter(vore %in% c("herbi", "carni"))
# OR:
%>%
msleep filter(vore == "herbi" | vore == "carni")
filter()
to subset msleep
to include only herbivores and carnivores who sleep at least 12 hours a day.
filter()
: subet based on vore
AND subset based on sleep_total
%>%
msleep filter(vore %in% c("herbi", "carni"),
>= 12)
sleep_total
# OR:
%>%
msleep filter((vore == "herbi" | vore == "carni"),
>= 12) sleep_total
filter()
to subset the data to only carnivores who weigh more than 50 kg.
%>%
msleep filter(vore == "carni", bodywt > 50)
filter()
.
filter()
function instead of the actual string “carni” and the value 50.target_vore
and target_weight
.<- "carni"
target_vore <- 50
target_weight %>%
msleep filter(vore == target_vore, bodywt > target_weight)
filter()
code; you only need to change the variable definitions.# the variables change, but not the filtering
<- "herbi"
target_vore <- 25
target_weight # see, this next code stays the same:
%>%
msleep filter(vore == target_vore, bodywt > target_weight)
select()
select()
to keep only the columns name
, awake
, sleep_total
, sleep_rem
, and sleep_cycle
.
c()
.%>%
msleep select(name, awake, sleep_total, sleep_rem, sleep_cycle)
select()
to remove the columns genus
and order
.
%>%
msleep select(-genus, -order)
select()
to re-order columns. This is often useful, for example, when viewing datasets that have a lot of columns, and you want to move some columns to the front. The code below moves the column vore
to the front, followed by “everything else”, which is represented by the extremely cool and convenient code everything()
. Engage with and understand this code, and then modify it code below to reorder the columns as: bodywt
, brainwt
, then everything else.
select()
: The column you want to appear first, the column you want to appear second, and finally everything else.msleep %>%
select(vore, everything())
%>%
msleep select(bodywt, brainwt, everything())
Often we need to know “how many rows are in this wrangled data frame?” There are broadly two ways to do this:
nrow()
function which you already know! Because nrow()
takes a data frame as its argument, we can pipe dplyr
pipelines into it! But it returns a NUMBER, NOT another data frame, so we can’t pipe out of it into another dplyr
function.nrow(msleep)
# or..
msleep %>%
nrow()
dplyr
function tally()
which gives you a tibble of the row count. We will learn more interesting uses of this function later, but for now, you should know that it will count your rows and return a tibble:msleep %>%
tally()
filter()
to get only the rows of interest (mammals which weigh more than 2000)tally()
to the pipeline)%>%
msleep filter(bodywt > 2000) %>%
tally()
mutate()
mutate()
to create a new column called class
which literally just contains the string “Mammalia”. Indeed, these are all mammals!
mutate()
? Don’t forget: get_help("mutate")
!%>%
msleep mutate(class = "Mammalia")
By default when you create a new column, the column is placed at the END of the data frame. It can be pretty annoying to scroll through the whole dataset to check that your new column was made correctly. It is very helpful to use the select()
function to rearrange or subset columns to make sure your code worked properly. As we will see more in depth very soon, the beauty of the pipe %>%
if that you can chain more and more dplyr
commands together.
select()
can be used to help you reorganize columns so that you can more easily check that your answers are right.
msleep %>%
# first, create the column class
mutate(class = "Mammalia") %>%
# second, keep only the column class to more easily make sure it worked
select(class)
# Or, reorganize columns using everything() when calling select to place `class` first to make sure it worked
# This rearranges columns: place `class` first, and then have "everything else"
%>%
msleep # first, create the column class
mutate(class = "Mammalia") %>%
# then, make class the first-appearing column
select(class, everything())
When writing multiple pipes, always build it up ONE line at a time! There is no race to the finish line. For example, if your first command doesn’t work properly, there is no chance your second one will work properly. You have to check with your own personal eyeballs that each line of code worked BEFORE appending the next.
mutate()
to add a new column to msleep
called class
that contains the string “Mammalia”. Look at the output to make sure it worked.filter()
to REMOVE all mammals in the order “Rodentia” (hint: remember the !=
logical operator!).select()
to only keep columns in this order: class
, order
, genus
, name
.%>%
msleep mutate(class = "Mammalia") %>%
filter(order != "Rodentia") %>%
select(class, order, genus, name)
mutate()
to create a new column called bodywt_g
which contains the body weight but in grams instead of kg, as is recorded in the existing bodywt
column. Engage with this code, and then modify it to instead create a new variable called bodywt_lbs
which contains the body weight in pounds (1 kg = 2.2 lbs).
select()
to the end of your pipeline to make sure your code worked as intended?msleep %>%
mutate(bodywt_g = bodywt * 1000) # multiply kg by 1000 to get grams
%>%
msleep mutate(bodywt_lbs = bodywt * 2.2) # multiply kg by 2.2 to get pounds
# To check your answer, I recommend:
%>%
msleep mutate(bodywt_lbs = bodywt * 2.2) %>%
# selecting both of these columns will help you confirm that bodywt_lbs=2.2*bodywt
select(bodywt_lbs, bodywt)
mutate()
to create a new column called percent_day_awake
that gives the percentage of the day that each species spends awake, and use select()
at the end to make sure your calculations worked.
awake
says how many hours a day (on average) that species is awake. So, (awake / 24) * 100
is the percent awake!%>%
msleep mutate(percent_day_awake = (awake / 24) * 100) %>%
# Select the column we created to ensure it worked
select(percent_day_awake)
mutate()
to create a new column called log_bodywt
that gives the natural logarithm of the body weight, and use select()
at the end to make sure your calculations worked.
log()
by default calculates the natural logarithm (ln).%>%
msleep mutate(log_bodywt = log(bodywt)) %>%
select(log_bodywt)
mutate()
to create a new column called sleep_awake_ratio
that has the ratio of total time spent asleep to total time spent awake (sleep_total
divided by awake
), and again use select()
to make sure it worked.
%>%
msleep mutate(sleep_awake_ratio = sleep_total/awake) %>%
select(sleep_awake_ratio)
dplyr
and into the package tidyr
, which is part of the tidyverse
. This package (which has been loaded for you!) contains a function drop_na()
which removes NA
’s from a tibble. Explore the use of this function with get_help("drop_na")
, and then use the function to remove all rows from msleep
that contain NA
s in the following columns:
brainwt
sleep_cycle
Hint: If done correctly, there should only be 30 rows remaining out of the original total 83.
%>%
msleep drop_na(brainwt, sleep_cycle)
mutate()
containing the mean brainwt and then median bodywt for all mammals. Name these columns mean_brainwt
and median_bodywt
, respectively, and use select()
at the end to make sure it worked.
mutate()
, just add commas!NA
s first? Try it without removing the NAs
to figure out the issue!drop_na()
%>%
msleep drop_na(brainwt, bodywt) %>%
mutate(mean_brainwt = mean(brainwt),
median_bodywt = median(bodywt)) %>%
select(mean_brainwt, median_bodywt)
# Without drop_na():
%>%
msleep mutate(mean_brainwt = mean(brainwt, na.rm = TRUE),
median_bodywt = median(bodywt, na.rm = TRUE)) %>%
select(mean_brainwt, median_bodywt)
ifelse()
which can define a value based on a condition. Engage with the code below to refresh your memory:
animal <- "goat"
# T/F condition use if TRUE use if FALSE
is_it_a_goat <- ifelse(animal == "goat", "totes goat", "goatless")
dplyr
actually has its own version of this function called if_else()
(it has an underscore). This version of the function is technically faster and “safer” to use in your dplyr
code, but either ifelse()
or if_else()
will be fine for the purposes of our class. We’ll use if_else()
here to get in the habit! (Come to office hours to learn more about how ifelse()
differs from if_else()
!)
mutate()
to create new columns whose value is conditioned on something else. For example, the code below creates a new column using if_else()
to record if a mammal is, or is not, a carnivore. Engage with this code to understand it, and then modify the code to instead make a new column called are_you_herbi
. This column should contain the value “herbivore” if yes, and “not_an_herbivore” if no.
vore
even when using if_else()
, because it is all part of the mutate()
code.if_else()
is part of the mutate()
code - It is not a stand-alone `dplyr verb whose first argument is a data frame!!select()
to make sure you did it all correctly! We’ll want to select vore
and are_you_herbi
to make sure “herbivores” match up with the right value, etc.msleep %>%
mutate(are_you_carni = if_else(vore == "carni", "carnivore", "not_a_carnivore"))
# Helpful to pipe into select() to keep `vore` and `are_you_carni` FOR THE PURPOSES OF CHECKING IF THE MUTATE WORKED, eg:
msleep %>%
mutate(are_you_carni = if_else(vore == "carni", "carnivore", "not_a_carnivore")) %>%
select(vore, are_you_carni)
%>%
msleep mutate(are_you_herbi = if_else(vore == "herbi", "herbivore", "not_an_herbivore"))
mutate()
code.<- 100
weight_threshold %>%
msleep mutate(weight_class = if_else(bodywt >= weight_threshold, "heavy", "light")) %>%
# and check with select:
select(bodywt, weight_class)
msleep
called needs_more_caffeine
where mammals who are awake (awake
) more than 16 hours a day have the value “definitely” and other mammals have “nope”.
<- 16
awake_level %>%
msleep mutate(needs_more_caffeine = if_else(awake > awake_level, "definitely", "nope")) %>%
# and check:
select(awake, needs_more_caffeine)
rename()
and arrange()
rename()
to change the name of the column conservation
to conservation_status
.
rename()
syntax is: rename(newname = oldname)
. You do NOT need to use quotes.NEWname = OLDname
.%>%
msleep rename(conservation_status = conservation)
arrange()
to sort the dataset in ascending order of bodywt
.
arrange()
sorts in ascending order by default%>%
msleep arrange(bodywt)
arrange()
to sort the dataset in descending order of bodywt
.
desc()
to sort by descending order of a column instead of just writing the column name, like: arrange( desc(COLUMN) )
.%>%
msleep arrange(desc(bodywt))
rename()
to change the name of the column vore
to food_preference
.
%>%
msleep rename(food_preference = vore)
dplyr
called slice()
, which will keep/remove rows based on which row it is (similar but different from filter()
, which subsets rows based on TRUE
or FALSE
). The code below keeps only the first two rows of msleep
for example. Engage with this code to make sure you understand:
msleep
tibble so you can convince yourself that indeed these are the first two rows (and 5th, 7th, and 16th later!).msleep %>%
# Keep rows 1-2
slice(1,2)
# Or:
msleep %>%
# Keep rows 1-2. They are contiguous so I can use : also
slice(1:2)
# One more example: keep rows 5, 7, and 16
msleep %>%
slice(5, 7, 15)
Importantly, the slice()
function is really conveniently used along with arrange()
: Imagine we want to only keep the top 10 values of a certain variable? We can arrange on that variable and then slice the top 10 rows (i.e. rows 1-10, which in R is 1:10
).
bodywt
.
msleep %>%
# Arrange in *descending order* (we want top values!) of sleep_cycle
arrange(desc(sleep_cycle)) %>%
# Keep top 5 sleep cycles
slice(1:5)
%>%
msleep arrange(desc(bodywt)) %>%
slice(1:5)
In many circumstances, we are interested in subsetting data to only keep unique rows and therefore remove duplicates. We simply use the dplyr
function distinct()
for this - no arguments! Below shows you how to use the function, but it’s not very interesting yet since there are no duplicate rows in msleep
!
msleep %>%
distinct()
NA
?
msleep %>%
select(vore, conservation)
%>%
msleep select(vore, conservation) %>%
distinct() %>%
drop_na()
dplyr
verbsRemember: You can always use functions like
select()
to check your code, even ifselect()
is not actually part of the solution.
filter()
to subset msleep
to only herbivores, and then use arrange()
to order the data by name
.
%>%
msleep filter(vore == "herbi") %>%
arrange(name)
filter()
to subset msleep
to only species whose conservation status is least concern (“lc”), and then use select()
to remove the conservation
column, and finally remove all NA
s with drop_na()
. Save the final output of your piped commands to a new data frame called msleep_lc
, and then print the new data frame to confirm your work was successful.
->
to “send” the final output into the variable name msleep_lc
.%>%
msleep filter(conservation == "lc") %>%
select(-conservation) %>%
drop_na() -> msleep_lc
msleep_lc
awake
) is greater than the amount of hours spent asleep (sleep_total
awake
by sleep_total
is greater than 1.# One approach:
%>%
msleep filter(awake > sleep_total)
# Another approach:
%>%
msleep filter(abs(awake - sleep_total) > 0)
# And anther approach
%>%
msleep filter(awake/sleep_total > 1)
# Or make a variable along the way, why not! However you want to do it, as long as you use dplyr code
%>%
msleep mutate(difference = awake - sleep_total)
filter(abs(difference) > 0)
filter()
to subset msleep
to only primate species (order is “Primates”) whose conservation status is least concern (“lc”) (two things to filter!!), and then use rename()
to change the column vore
to be called diet
.
filter()
to specify them both.%>%
msleep filter(order == "Primates", conservation == "lc") %>%
rename(diet = vore)
name
, bodywt
, brainwt
in that order.
name
to common_name
. You can do this in select()
using rename()
syntax.%>%
msleep filter(vore == "carni", bodywt > 50) %>%
arrange(bodywt) %>%
select(name, bodywt, brainwt)
# Added challenge
%>%
msleep filter(vore == "carni", bodywt > 50) %>%
arrange(bodywt) %>%
select(common_name = name, bodywt, brainwt)
arrange()
and select()
. Does this change the output? Understand why or why not.
%>%
msleep filter(vore == "carni", bodywt > 50) %>%
select(name, bodywt, brainwt) %>%
arrange(brainwt)
vore
column, so you can’t posssibly filter it.%>%
msleep select(name, bodywt, brainwt) %>%
filter(vore == "carni", bodywt > 50) %>%
arrange(brainwt)
bodywt
and name
columns for all mammals whose brain weight is less than 2. Alas, the code has a bug! Can you figure out WHY the code has a bug and fix the code?
msleep %>%
select(bodywt, name) %>%
filter(brainwt < 2)
# Need to reverse the order! The original "logic" was flawed
%>%
msleep filter(brainwt < 2) %>%
select(bodywt, name)
genus
. Alas, the code has a bug yet again! Can you fix the code?
msleep %>%
filter(vore == "carni") %>%
select(sleep_rem) %>%
arrange(genus)
%>%
msleep arrange(genus) %>%
filter(vore == "carni") %>%
select(sleep_rem)
# Works just as well!
%>%
msleep filter(vore == "carni") %>%
arrange(genus) %>%
select(sleep_rem)
distinct()
is a helpful function. We want to answer this question using a dplyr
framework: What are the unique vores in the dataset? To address this question, we need to first subset the data to only contain the vore
column, and then use distinct()
:
vore
.msleep %>%
select(vore) %>%
distinct()
msleep
to arrive at a tibble that contains just the column vore
and shows only the unique vores that mammals of the order "Carnivora"
belong to. In other words, what do carnivores eat? (The answer should make some sense…)
%>%
msleep filter(order == "Carnivora") %>%
select(vore) %>%
distinct()
dplyr
summarize()
to create a summarized dataframe with a column mean_awake
that contains the mean number of hours spent awake.
%>%
msleep summarize(mean_awake = mean(awake))
mutate()
rather than summarize()
. The goal of this question is so you can see clearly how mutate and summarize differ.
%>%
msleep # Now we have a whole column where all rows have the same value.
mutate(mean_awake = mean(awake))
mutate()
vs summarize()
. We want to know: How does each mammal’s time spent asleep compare to the average amount of time mammals sleep? In this case, we’ll calculate the difference between each species’ sleep and the average sleep, and we want a row for each mammal showing the difference.
name
and awake_difference
(representing the difference between the species and the average).mutate()
or summarize()
is more appropriate.%>%
msleep mutate(mean_awake = mean(awake)) %>%
mutate(awake_difference = awake - mean_awake) %>%
select(name, awake_difference)
# Done another way:
# You can use a single call to mutate() to make several columns at once, and new lines know about the previous ones!
%>%
msleep mutate(mean_awake = mean(awake),
awake_difference = awake - mean_awake) %>%
select(name, awake_difference)
# Done another way: No need to even make mean_awake!
%>%
msleep mutate(awake_difference = awake - mean(awake)) %>%
select(name, awake_difference)
summarize()
and mean()
to determine the average amount of time spent in REM sleep (column sleep_rem
) by all mammals in the dataset msleep
.
NA
’s in this column, so you need to tell the function mean()
to ignore NA’s with the extra argument na.rm = TRUE
. Remember that?! It’s an argument to mean()
, NOT to summarize()
!!%>%
msleep summarize(mean_rem = mean(sleep_rem, na.rm=TRUE))
group_by()
calculate the median (with summarize()
and median()
!) body weight (bodywt
) of each vore
group.
group_by(COLUMN-TO-GROUP-BY)
(in this case, vore
), and then pipe into summarize()
%>%
msleep group_by(vore) %>%
summarize(med_bodywt = median(bodywt))
group_by()
calculate the maximum (with summarize()
and max()
!) brain weight (brainwt
) of each vore
group. Then, sort the data according to maximum brain weight (your new well-named column!) with arrange()
.
%>%
msleep drop_na(brainwt) %>%
group_by(vore) %>%
summarize(max_brainwt = max(brainwt)) %>%
arrange(max_brainwt)
group_by()
calculate the mean body weight (bodywt
) of each combination of vore
and conservation
groups.
group_by()
just by listing the columns.NA
?%>%
msleep group_by(vore, conservation) %>%
summarize(mean_bodywt = mean(bodywt)) %>%
drop_na()
vore
. The preserved grouping can lead to unintended behavior. For example, let’s say I want to add up the mean_body
values: I should end up with 1 number (the sum), but I don’t!
ungroup()
before proceeding:
ungroup()
to be safe right after your grouped calculations.msleep %>%
group_by(vore, conservation) %>%
summarize(mean_bodywt = mean(bodywt)) %>%
# REMOVE ANY PREVIOUS GROUPINGS AFTER DONE WITH GROUPED CALCULATIONS:
ungroup() %>%
drop_na() %>%
# Add up the mean_body
summarize(sum_bw = sum(mean_bodywt))
count()
to count how many different taxonomic orders (column order
) are in the dataset msleep
. Rename the new column this creates to be called order_count
, and then sort the output in descending order of order_count
.
count(COLUMN)
is an awesome shortcut for counting all observations in a group COLUMN
. Need help? Use get_help("count")
arrange()
on!!!%>%
msleep count(order) %>%
rename(order_count = n) %>%
arrange(desc(order_count))
# There are MANY WAYS to arrive at this solution! Below is one good option:
%>%
msleep filter(vore %in% c("herbi", "insecti")) %>%
group_by(vore) %>%
summarize(mean_bodywt = mean(bodywt)) %>%
arrange(desc(mean_bodywt)) # answer: herbivores!
if_else()
will be useful here!NA
s. How do I know? I LOOKED AT THE DATA!%>%
msleep # I personally find it much easier to see what's going on by only keeping these 3 columns
select(conservation, bodywt, brainwt) %>%
# remove rows where at least one of our variables of interest is NA. How do I know to do this? I ran code without first this line, and results had tons of NA! So, maybe I should have removed them
drop_na(conservation, bodywt, brainwt) %>%
mutate(cons_type = if_else(conservation == "domesticated", "dom", "notdom")) %>%
group_by(cons_type) %>%
mutate(ratio = bodywt/brainwt) %>%
summarize(mean_ratio = mean(ratio)) # answer: domesticated