dplyrTo do these exercises, you will need to the tidyverse library. Create a script so you can save your work for future reference, and start with this code to begin:
library(tidyverse)
library(introverse) # if you want :)
These exercises use a pre-loaded dataset called msleep (sound familiar?), which provides different physical and behavioral characteristics of mammals, including how much they sleep.
Take some time to familiarize dataset before you proceed to work with it, using functions like head(), names(), etc. In particular, there is a great dplyr function glimpse() which can reveal a lot of helpful information about a data frame. It’s similar to the str() function, but the output is much nicer to look at. Run it to see!
glimpse(msleep)
dplyr verbsfilter()filter() to subset msleep to only herbivores.
vore column will tell you if a mammal is an herbivore. You want find all rows where vore == "herbi".vore == herbi? … "herbi" is a STRING, not a variable!msleep %>%
filter(vore == "herbi")filter() to subset msleep to only animals who are awake for at least 12 hours of the day.
>=msleep %>%
filter(awake >= 12)filter() to subset msleep to only herbivores who are awake for at least 12 hours of the days.
dplyr makes it easy to supply “and” conditions to filter() simply with commas: filter(statement1, statement2). This is “the same” as writing: filter(statement1 & statement2).msleep %>%
filter(vore == "herbi", awake >= 12)filter() to subset msleep to include only herbivores and insectivores, using the %in% logical operator to help craft the statement. Engage with and understand the code, and modify it to subset the data to keep only herbivores and carnivores.
%in% operator as in: filter(column %in% c(thing_i_want, other_thing_i_want)). In this case you want the vore to be something in this array: c("herbi", "insecti")| symbol to ask if vore equal “herbi” or if vore equals “insecti”msleep %>%
filter(vore %in% c("herbi", "insecti"))
msleep %>%
filter(vore %in% c("herbi", "carni"))
# OR:
msleep %>%
filter(vore == "herbi" | vore == "carni")filter() to subset msleep to include only herbivores and carnivores who sleep at least 12 hours a day.
filter(): subet based on vore AND subset based on sleep_totalmsleep %>%
filter(vore %in% c("herbi", "carni"),
sleep_total >= 12)
# OR:
msleep %>%
filter((vore == "herbi" | vore == "carni"),
sleep_total >= 12)filter() to subset the data to only carnivores who weigh more than 50 kg.
msleep %>%
filter(vore == "carni", bodywt > 50)filter().
filter() function instead of the actual string “carni” and the value 50.target_vore and target_weight.target_vore <- "carni"
target_weight <- 50
msleep %>%
filter(vore == target_vore, bodywt > target_weight)filter() code; you only need to change the variable definitions.# the variables change, but not the filtering
target_vore <- "herbi"
target_weight <- 25
# see, this next code stays the same:
msleep %>%
filter(vore == target_vore, bodywt > target_weight)select()select() to keep only the columns name, awake, sleep_total, sleep_rem, and sleep_cycle.
c().msleep %>%
select(name, awake, sleep_total, sleep_rem, sleep_cycle)select() to remove the columns genus and order.
msleep %>%
select(-genus, -order)select() to re-order columns. This is often useful, for example, when viewing datasets that have a lot of columns, and you want to move some columns to the front. The code below moves the column vore to the front, followed by “everything else”, which is represented by the extremely cool and convenient code everything(). Engage with and understand this code, and then modify it code below to reorder the columns as: bodywt, brainwt, then everything else.
select(): The column you want to appear first, the column you want to appear second, and finally everything else.msleep %>%
select(vore, everything())
msleep %>%
select(bodywt, brainwt, everything())Often we need to know “how many rows are in this wrangled data frame?” There are broadly two ways to do this:
nrow() function which you already know! Because nrow() takes a data frame as its argument, we can pipe dplyr pipelines into it! But it returns a NUMBER, NOT another data frame, so we can’t pipe out of it into another dplyr function.nrow(msleep)
# or..
msleep %>%
nrow()
dplyr function tally() which gives you a tibble of the row count. We will learn more interesting uses of this function later, but for now, you should know that it will count your rows and return a tibble:msleep %>%
tally()
filter() to get only the rows of interest (mammals which weigh more than 2000)tally() to the pipeline)msleep %>%
filter(bodywt > 2000) %>%
tally()mutate()mutate() to create a new column called class which literally just contains the string “Mammalia”. Indeed, these are all mammals!
mutate()? Don’t forget: get_help("mutate")!msleep %>%
mutate(class = "Mammalia")By default when you create a new column, the column is placed at the END of the data frame. It can be pretty annoying to scroll through the whole dataset to check that your new column was made correctly. It is very helpful to use the select() function to rearrange or subset columns to make sure your code worked properly. As we will see more in depth very soon, the beauty of the pipe %>% if that you can chain more and more dplyr commands together.
select() can be used to help you reorganize columns so that you can more easily check that your answers are right.
msleep %>%
# first, create the column class
mutate(class = "Mammalia") %>%
# second, keep only the column class to more easily make sure it worked
select(class)
# Or, reorganize columns using everything() when calling select to place `class` first to make sure it worked
# This rearranges columns: place `class` first, and then have "everything else"
msleep %>%
# first, create the column class
mutate(class = "Mammalia") %>%
# then, make class the first-appearing column
select(class, everything())When writing multiple pipes, always build it up ONE line at a time! There is no race to the finish line. For example, if your first command doesn’t work properly, there is no chance your second one will work properly. You have to check with your own personal eyeballs that each line of code worked BEFORE appending the next.
mutate() to add a new column to msleep called class that contains the string “Mammalia”. Look at the output to make sure it worked.filter() to REMOVE all mammals in the order “Rodentia” (hint: remember the !=logical operator!).select() to only keep columns in this order: class, order, genus, name.msleep %>%
mutate(class = "Mammalia") %>%
filter(order != "Rodentia") %>%
select(class, order, genus, name)mutate() to create a new column called bodywt_g which contains the body weight but in grams instead of kg, as is recorded in the existing bodywt column. Engage with this code, and then modify it to instead create a new variable called bodywt_lbs which contains the body weight in pounds (1 kg = 2.2 lbs).
select() to the end of your pipeline to make sure your code worked as intended?msleep %>%
mutate(bodywt_g = bodywt * 1000) # multiply kg by 1000 to get grams
msleep %>%
mutate(bodywt_lbs = bodywt * 2.2) # multiply kg by 2.2 to get pounds
# To check your answer, I recommend:
msleep %>%
mutate(bodywt_lbs = bodywt * 2.2) %>%
# selecting both of these columns will help you confirm that bodywt_lbs=2.2*bodywt
select(bodywt_lbs, bodywt)mutate() to create a new column called percent_day_awake that gives the percentage of the day that each species spends awake, and use select() at the end to make sure your calculations worked.
awake says how many hours a day (on average) that species is awake. So, (awake / 24) * 100 is the percent awake!msleep %>%
mutate(percent_day_awake = (awake / 24) * 100) %>%
# Select the column we created to ensure it worked
select(percent_day_awake)mutate() to create a new column called log_bodywt that gives the natural logarithm of the body weight, and use select() at the end to make sure your calculations worked.
log() by default calculates the natural logarithm (ln).msleep %>%
mutate(log_bodywt = log(bodywt)) %>%
select(log_bodywt)mutate() to create a new column called sleep_awake_ratio that has the ratio of total time spent asleep to total time spent awake (sleep_total divided by awake), and again use select() to make sure it worked.
msleep %>%
mutate(sleep_awake_ratio = sleep_total/awake) %>%
select(sleep_awake_ratio)dplyr and into the package tidyr, which is part of the tidyverse. This package (which has been loaded for you!) contains a function drop_na() which removes NA’s from a tibble. Explore the use of this function with get_help("drop_na"), and then use the function to remove all rows from msleep that contain NAs in the following columns:
brainwt
sleep_cycle
Hint: If done correctly, there should only be 30 rows remaining out of the original total 83.
msleep %>%
drop_na(brainwt, sleep_cycle)mutate() containing the mean brainwt and then median bodywt for all mammals. Name these columns mean_brainwt and median_bodywt, respectively, and use select() at the end to make sure it worked.
mutate(), just add commas!NAs first? Try it without removing the NAs to figure out the issue!drop_na()msleep %>%
drop_na(brainwt, bodywt) %>%
mutate(mean_brainwt = mean(brainwt),
median_bodywt = median(bodywt)) %>%
select(mean_brainwt, median_bodywt)
# Without drop_na():
msleep %>%
mutate(mean_brainwt = mean(brainwt, na.rm = TRUE),
median_bodywt = median(bodywt, na.rm = TRUE)) %>%
select(mean_brainwt, median_bodywt)ifelse() which can define a value based on a condition. Engage with the code below to refresh your memory:
animal <- "goat"
# T/F condition use if TRUE use if FALSE
is_it_a_goat <- ifelse(animal == "goat", "totes goat", "goatless")
dplyr actually has its own version of this function called if_else() (it has an underscore). This version of the function is technically faster and “safer” to use in your dplyr code, but either ifelse() or if_else() will be fine for the purposes of our class. We’ll use if_else() here to get in the habit! (Come to office hours to learn more about how ifelse() differs from if_else()!)
mutate() to create new columns whose value is conditioned on something else. For example, the code below creates a new column using if_else() to record if a mammal is, or is not, a carnivore. Engage with this code to understand it, and then modify the code to instead make a new column called are_you_herbi. This column should contain the value “herbivore” if yes, and “not_an_herbivore” if no.
vore even when using if_else(), because it is all part of the mutate() code.if_else() is part of the mutate() code - It is not a stand-alone `dplyr verb whose first argument is a data frame!!select() to make sure you did it all correctly! We’ll want to select vore and are_you_herbi to make sure “herbivores” match up with the right value, etc.msleep %>%
mutate(are_you_carni = if_else(vore == "carni", "carnivore", "not_a_carnivore"))
# Helpful to pipe into select() to keep `vore` and `are_you_carni` FOR THE PURPOSES OF CHECKING IF THE MUTATE WORKED, eg:
msleep %>%
mutate(are_you_carni = if_else(vore == "carni", "carnivore", "not_a_carnivore")) %>%
select(vore, are_you_carni)
msleep %>%
mutate(are_you_herbi = if_else(vore == "herbi", "herbivore", "not_an_herbivore"))mutate() code.weight_threshold <- 100
msleep %>%
mutate(weight_class = if_else(bodywt >= weight_threshold, "heavy", "light")) %>%
# and check with select:
select(bodywt, weight_class)msleep called needs_more_caffeine where mammals who are awake (awake) more than 16 hours a day have the value “definitely” and other mammals have “nope”.
awake_level <- 16
msleep %>%
mutate(needs_more_caffeine = if_else(awake > awake_level, "definitely", "nope")) %>%
# and check:
select(awake, needs_more_caffeine)rename() and arrange()rename() to change the name of the column conservation to conservation_status.
rename() syntax is: rename(newname = oldname). You do NOT need to use quotes.NEWname = OLDname.msleep %>%
rename(conservation_status = conservation)arrange() to sort the dataset in ascending order of bodywt.
arrange() sorts in ascending order by defaultmsleep %>%
arrange(bodywt)arrange() to sort the dataset in descending order of bodywt.
desc() to sort by descending order of a column instead of just writing the column name, like: arrange( desc(COLUMN) ).msleep %>%
arrange(desc(bodywt))rename() to change the name of the column vore to food_preference.
msleep %>%
rename(food_preference = vore)dplyr called slice(), which will keep/remove rows based on which row it is (similar but different from filter(), which subsets rows based on TRUE or FALSE). The code below keeps only the first two rows of msleep for example. Engage with this code to make sure you understand:
msleep tibble so you can convince yourself that indeed these are the first two rows (and 5th, 7th, and 16th later!).msleep %>%
# Keep rows 1-2
slice(1,2)
# Or:
msleep %>%
# Keep rows 1-2. They are contiguous so I can use : also
slice(1:2)
# One more example: keep rows 5, 7, and 16
msleep %>%
slice(5, 7, 15)
Importantly, the slice() function is really conveniently used along with arrange(): Imagine we want to only keep the top 10 values of a certain variable? We can arrange on that variable and then slice the top 10 rows (i.e. rows 1-10, which in R is 1:10).
bodywt.
msleep %>%
# Arrange in *descending order* (we want top values!) of sleep_cycle
arrange(desc(sleep_cycle)) %>%
# Keep top 5 sleep cycles
slice(1:5)
msleep %>%
arrange(desc(bodywt)) %>%
slice(1:5) In many circumstances, we are interested in subsetting data to only keep unique rows and therefore remove duplicates. We simply use the dplyr function distinct() for this - no arguments! Below shows you how to use the function, but it’s not very interesting yet since there are no duplicate rows in msleep!
msleep %>%
distinct()
NA?
msleep %>%
select(vore, conservation)
msleep %>%
select(vore, conservation) %>%
distinct() %>%
drop_na()dplyr verbsRemember: You can always use functions like
select()to check your code, even ifselect()is not actually part of the solution.
filter() to subset msleep to only herbivores, and then use arrange() to order the data by name.
msleep %>%
filter(vore == "herbi") %>%
arrange(name)filter() to subset msleep to only species whose conservation status is least concern (“lc”), and then use select() to remove the conservation column, and finally remove all NAs with drop_na(). Save the final output of your piped commands to a new data frame called msleep_lc, and then print the new data frame to confirm your work was successful.
-> to “send” the final output into the variable name msleep_lc.msleep %>%
filter(conservation == "lc") %>%
select(-conservation) %>%
drop_na() -> msleep_lc
msleep_lcawake) is greater than the amount of hours spent asleep (sleep_totalawake by sleep_total is greater than 1.# One approach:
msleep %>%
filter(awake > sleep_total)
# Another approach:
msleep %>%
filter(abs(awake - sleep_total) > 0)
# And anther approach
msleep %>%
filter(awake/sleep_total > 1)
# Or make a variable along the way, why not! However you want to do it, as long as you use dplyr code
msleep %>%
mutate(difference = awake - sleep_total)
filter(abs(difference) > 0)filter() to subset msleep to only primate species (order is “Primates”) whose conservation status is least concern (“lc”) (two things to filter!!), and then use rename() to change the column vore to be called diet.
filter() to specify them both.msleep %>%
filter(order == "Primates", conservation == "lc") %>%
rename(diet = vore)name, bodywt, brainwt in that order.
name to common_name. You can do this in select() using rename() syntax.msleep %>%
filter(vore == "carni", bodywt > 50) %>%
arrange(bodywt) %>%
select(name, bodywt, brainwt)
# Added challenge
msleep %>%
filter(vore == "carni", bodywt > 50) %>%
arrange(bodywt) %>%
select(common_name = name, bodywt, brainwt)arrange() and select(). Does this change the output? Understand why or why not.
msleep %>%
filter(vore == "carni", bodywt > 50) %>%
select(name, bodywt, brainwt) %>%
arrange(brainwt)vore column, so you can’t posssibly filter it.msleep %>%
select(name, bodywt, brainwt) %>%
filter(vore == "carni", bodywt > 50) %>%
arrange(brainwt)bodywt and name columns for all mammals whose brain weight is less than 2. Alas, the code has a bug! Can you figure out WHY the code has a bug and fix the code?
msleep %>%
select(bodywt, name) %>%
filter(brainwt < 2)
# Need to reverse the order! The original "logic" was flawed
msleep %>%
filter(brainwt < 2) %>%
select(bodywt, name) genus. Alas, the code has a bug yet again! Can you fix the code?
msleep %>%
filter(vore == "carni") %>%
select(sleep_rem) %>%
arrange(genus)
msleep %>%
arrange(genus) %>%
filter(vore == "carni") %>%
select(sleep_rem)
# Works just as well!
msleep %>%
filter(vore == "carni") %>%
arrange(genus) %>%
select(sleep_rem)distinct() is a helpful function. We want to answer this question using a dplyr framework: What are the unique vores in the dataset? To address this question, we need to first subset the data to only contain the vore column, and then use distinct():
vore.msleep %>%
select(vore) %>%
distinct()
msleep to arrive at a tibble that contains just the column vore and shows only the unique vores that mammals of the order "Carnivora" belong to. In other words, what do carnivores eat? (The answer should make some sense…)
msleep %>%
filter(order == "Carnivora") %>%
select(vore) %>%
distinct()dplyrsummarize() to create a summarized dataframe with a column mean_awake that contains the mean number of hours spent awake.
msleep %>%
summarize(mean_awake = mean(awake))mutate() rather than summarize(). The goal of this question is so you can see clearly how mutate and summarize differ.
msleep %>%
# Now we have a whole column where all rows have the same value.
mutate(mean_awake = mean(awake))mutate() vs summarize(). We want to know: How does each mammal’s time spent asleep compare to the average amount of time mammals sleep? In this case, we’ll calculate the difference between each species’ sleep and the average sleep, and we want a row for each mammal showing the difference.
name and awake_difference (representing the difference between the species and the average).mutate() or summarize() is more appropriate.msleep %>%
mutate(mean_awake = mean(awake)) %>%
mutate(awake_difference = awake - mean_awake) %>%
select(name, awake_difference)
# Done another way:
# You can use a single call to mutate() to make several columns at once, and new lines know about the previous ones!
msleep %>%
mutate(mean_awake = mean(awake),
awake_difference = awake - mean_awake) %>%
select(name, awake_difference)
# Done another way: No need to even make mean_awake!
msleep %>%
mutate(awake_difference = awake - mean(awake)) %>%
select(name, awake_difference)summarize() and mean() to determine the average amount of time spent in REM sleep (column sleep_rem) by all mammals in the dataset msleep.
NA’s in this column, so you need to tell the function mean() to ignore NA’s with the extra argument na.rm = TRUE. Remember that?! It’s an argument to mean(), NOT to summarize()!!msleep %>%
summarize(mean_rem = mean(sleep_rem, na.rm=TRUE))group_by() calculate the median (with summarize() and median()!) body weight (bodywt) of each vore group.
group_by(COLUMN-TO-GROUP-BY) (in this case, vore), and then pipe into summarize()msleep %>%
group_by(vore) %>%
summarize(med_bodywt = median(bodywt))group_by() calculate the maximum (with summarize() and max()!) brain weight (brainwt) of each vore group. Then, sort the data according to maximum brain weight (your new well-named column!) with arrange().
msleep %>%
drop_na(brainwt) %>%
group_by(vore) %>%
summarize(max_brainwt = max(brainwt)) %>%
arrange(max_brainwt)group_by() calculate the mean body weight (bodywt) of each combination of vore and conservation groups.
group_by() just by listing the columns.NA?msleep %>%
group_by(vore, conservation) %>%
summarize(mean_bodywt = mean(bodywt)) %>%
drop_na()vore. The preserved grouping can lead to unintended behavior. For example, let’s say I want to add up the mean_body values: I should end up with 1 number (the sum), but I don’t!
ungroup() before proceeding:
ungroup() to be safe right after your grouped calculations.msleep %>%
group_by(vore, conservation) %>%
summarize(mean_bodywt = mean(bodywt)) %>%
# REMOVE ANY PREVIOUS GROUPINGS AFTER DONE WITH GROUPED CALCULATIONS:
ungroup() %>%
drop_na() %>%
# Add up the mean_body
summarize(sum_bw = sum(mean_bodywt))
count() to count how many different taxonomic orders (column order) are in the dataset msleep. Rename the new column this creates to be called order_count, and then sort the output in descending order of order_count.
count(COLUMN) is an awesome shortcut for counting all observations in a group COLUMN. Need help? Use get_help("count")arrange() on!!!msleep %>%
count(order) %>%
rename(order_count = n) %>%
arrange(desc(order_count))# There are MANY WAYS to arrive at this solution! Below is one good option:
msleep %>%
filter(vore %in% c("herbi", "insecti")) %>%
group_by(vore) %>%
summarize(mean_bodywt = mean(bodywt)) %>%
arrange(desc(mean_bodywt)) # answer: herbivores!if_else() will be useful here!NAs. How do I know? I LOOKED AT THE DATA!msleep %>%
# I personally find it much easier to see what's going on by only keeping these 3 columns
select(conservation, bodywt, brainwt) %>%
# remove rows where at least one of our variables of interest is NA. How do I know to do this? I ran code without first this line, and results had tons of NA! So, maybe I should have removed them
drop_na(conservation, bodywt, brainwt) %>%
mutate(cons_type = if_else(conservation == "domesticated", "dom", "notdom")) %>%
group_by(cons_type) %>%
mutate(ratio = bodywt/brainwt) %>%
summarize(mean_ratio = mean(ratio)) # answer: domesticated