Instructions: Homework #2

For this homework, you will be writing your very first RMarkdown file and practicing your base R skills.

Instructions

Obtain the homework from your RStudio Cloud class project by running the following code in the R Console:
```
library(ds4b.materials) # Load the class library
launch_homework(2)      # Launch Homework 2
```
Answer each question with appropriate code and comments in the given question’s R chunk. Chunks are named as q1 for question 1, q2 for question 2, etc.
- Always ensure your answers are being produced by code. No credit will be given for answers that are not directly produced by code. Please refer to the examples at the bottom of these instructions for examples of acceptable and unacceptable code.
- You should have one comment per line of code. Comments should be written in “plain english” and give brief descriptions about what the code is doing. Comments should be written above the line of code they refer to. You will not be graded on whether I think your comment is “good enough” - your comments need only be “good enough” for YOU to understand what your code is doing.
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
- Choose your favorite theme among the pre-packaged themes (ignore everything below “Even More Themes”) shown at this link
- Choose your favorite syntax highlighting among these options at this link
Don’t forget to watch these short videos as resources for navigating/customizing RStudio, writing RMarkdown documents, and using the introverse!

Homework Questions

Define two variables a and b as values 8 and 64, respectively. Use a logical operator and the function sqrt() to ask if a is equal to the square root of b, and print the output. The code should print TRUE if they are equal and FALSE if not.
- Need help? Run get_help("logical") and get_help("sqrt") in the R Console after loading the introverse library.

Define a variable success_status whose values depends on a logical comparison, using variables a and b from the previous question. If a squared equals b, success_status should be defined with the value “eureka”. Otherwise, if a squared does not equal b, success_status should be defined with the value “wompwomp”. Print the value of success_status after it is defined.
- Need help? Run get_help("ifelse") in the R Console after loading the introverse library.
- Do NOT redefine the a and b variables. You defined them in question #2, and you are still coding in the same Rmarkdown file. We always want to avoid repeating code since it often leads to unexpected bugs.

Define an array variable called mammals_vector to contain these strings: “camel”, “civet”, “mink”, “bat”, “ferret”, and “pangolin” (these are all mammals known to be vectors for either SARS-CoV-2 or one of its close relatives SARS and MERS). Use the logical operator %in% to determine if the string “elephant” is in the array. If “elephant” is in the array, your code should print TRUE, and FALSE otherwise.
- Need help? Run get_help("logical") in the R Console after loading the introverse library.

Define an array variable called dinosaurs to contain the three strings: “eagle”, “ostrich”, and “sparrow” (wait, those are all birds?! That’s right, birds are dinosaurs!). Then, define two other variables called mammals_length and dinosaurs_length. The variable mammals_length should be defined as the result of running length() on the mammals_vector array you defined in the previous question. The variable dinosaurs_length should similar be defined as the result of running length() on the dinosaurs array. Finally, use a logical operator to compare these variables. If mammals_length is larger than dinosaurs_length (in other words, the mammals array is longer than the dinosaurs array), your code should print TRUE. Otherwise, your code should print FALSE.
- Need help? Run get_help("length") in the R Console after loading the introverse library.
- Make sure to have variables dynamically assigned with length()! Scroll to the bottom of these instructions to see what I mean by making sure code produces the length variables!

Use the nchar() function to determine how many characters are in each string in the dinosaurs array, and save the result of this operation to a variable nchar_dinosaurs. Then, use a logical operator to ask if each value is equal to 7. Your code should print a new array containing TRUE or FALSE values representing whether this condition was met for each value.
- Need help? Run get_help("nchar") in the R Console after loading the introverse library.

Use a colon (:) to create a variable called q6 as an array of numbers in order -12 through 17. From this variable q6, calculate the mean, median, standard deviation, and sum (with functions mean(), median(), sd(),and sum(), respectively) of this array. Each calculation should be saved to a variable called, respectively: q6_mean, q6_median, q6_sd, and q6_sum. Finally, print these four variables. You do not need to print q6.
- Need help? Run get_help("mean") in the R Console after loading the introverse library.

Again consider the array q6. Use a logical operator to ask if all values in this array are less than or 0 (aka, is q6 less than or equal to 0?). The result of this code will be an array itself containing TRUE and FALSE values corresponding to whether the $\leq0$ condition was met for each value in q6. Save this resulting array to a variable called q6_leq0 (“leq” = “less than or equal to”), and print the value of q6_leq0.
- Need help? Run get_help("logical") in the R Console after loading the introverse library.

Teach yourself how to use the function all() using the introverse help docs, by running get_help("all") in the R Console, after you’ve loaded the introverse library. Then, use this function to determine whether all values of q6 are less than or equal to 0. Your code should print a single TRUE if all values are less than or equal to 0, and a single FALSE otherwise.
- Hint: The code is almost exactly identical to the code for question 7!

Again consider the array q6. For this question, find the sum of the absolute values of this array. To achieve this, you will need to first get the absolute value of the array q6 (using the abs() function), and then you will need to sum those values (using the sum() function). Your code chunk should reveal the final sum.
- Need help? Run get_help("sum") or get_help("abs") in the R Console after loading the introverse library.
- Hint: You can approach this question in several ways, as long as you use abs() and then sum() on q6 in that order.

There is a super useful function called paste() which allows you to strings (characters) together to create a new string. See how this function works below:

# combine the three strings "string 1", "and", and "string 2" into ONE string
combined_string <- paste("string 1", "and", "string 2")

# reveal value
combined_string

## [1] "string 1 and string 2"

# you can also include variables:
a_string <- "hello there"
combined_string2 <- paste("string 1", a_string, "string 2")

# reveal value
combined_string2

## [1] "string 1 hello there string 2"

# Let's use paste() to directly write a sentence (HINT!!)
# (don't worry about having a period at the end of sentence!)
variable_value <- 10
paste("The value of the variable is", variable_value)

## [1] "The value of the variable is 10"

Use the paste() function and the variable q6_mean you previously defined to print a sentence that says: “The mean of q6 is ….” where …. is the the q6_mean value. Your code should NOT include a number directly (don’t copy/paste the value of the mean). You may ONLY refer to this number value with the q6_mean variable.

Need help? Run get_help("paste") in the R Console after loading the introverse library.

For the remaining questions, you will consider the built-in R dataset iris, a famous dataset with 150 rows and 5 columns that provides physical measurements for 150 iris specimens from three different species.

For this question, write code to determine the following information about iris. You need to use a single function for each task (there will be four lines of code in the end), and all output should be printed from the chunk.
- The number of rows (using nrow())
- The number of columns (using ncol())
- The names of the columns (using names())
- A summary of the distribution of each column in the data (using summary())

The goal of this question is to answer, “do sepal lengths or petal lengths have more variation?” When comparing variation across different variables, we can calculate the _coefficient of variation (COV) which is simply the standard deviation divided by the mean. Calculate the COV for sepal lengths and petal lengths, and save to variables cov_sepal_length and cov_petal_length, respectively. Then, use a logical operator to ask whether sepal lengths have more (>) variation than petal lengths. Your code should print TRUE if sepal lengths have more spread, and FALSE otherwise.

Calculate the mean petal width and save the result to a variable mean_petal_width. Then, use paste() to print a sentence that reads: "The mean petal width in the iris dataset is ...", where ... is replaced with the value in mean_petal_width. Again, make sure your code is only using variables and not values directly.

The iris variable Species is a factor variable, meaning R recognizes it as containing categories. R refers to factor categories with the term levels. By contrast, the variable Sepal.Width is numeric, not a factor, so R doesn’t recognize that it has categories (which is reasonable! it’s numeric!). The R function levels() can be used to quickly see the categories (levels!) in a factor variable. For this question, practice using the levels() by running it twice: on the Species column and on the Sepal.Width column. Your code should simply print the result of running the levels() function on those two columns.
- Hint: The code output will be consistent with whether the variables are factors. Non-factor variables don’t have any levels and the code result will reflect this.

To learn a little more about factors, let’s coerce some columns into and away from factors. To do this, we’ll create a new dataset called iris2 containing some coerced columns. _Copy and paste the following code into the R chunk for this question (this code should be the first code in your answer!!). This code creates a new data frame iris2 that contains the same information as iris, but Species is no longer factor (it is now character), and Sepal.Width is no longer numeric (it is now a factor).
```
# Create a copy of iris in iris2
iris2 <- iris

# Coerce Species into a character. Previously, it was a factor.
iris2$Species <- as.character(iris2$Species)

# Coerce Sepal.Width into a factor. Previously, it was numeric.
iris2$Sepal.Width <- as.factor(iris2$Sepal.Width)
```
Again write code with levels() to see the categories in the iris2 columns Species and Sepal.Width, and notice (on your own - no need to write anything) how the output differs. Finally, write two more line of code in this chunk to calculate the mean of the Sepal.Width column in iris AND in iris2, which should also be printed out. In ~1 sentence of written text below the code chunk, explain why IN YOUR OWN WORDS this calculation didn’t work as well as you might have hoped (hint: the output from levels() is informative!).

What NOT to do vs what TO do

The following code chunks provide examples of how NOT TO and how TO answer a hypothetical homework question using code. Notice how there are many ways to correctly achieve the right result! Welcome to coding :)

Hypothetical homework question: What is the length of an array with values 4, 6, and 10?

NO! This code is not acceptable, largely because it is not code - it is just the number 3. There is no code used at all to answer the question. Even if it is “easy” to do something in your head, you must use code to answer it. It also will not produce any output from a script since it is not formally printed.

## [1] 3

NO! This code is not acceptable, because even though code was used at some point, the code itself does not directly produce the output. This example represents a situation where you might run length(c(4,6,10)) in the R Console, or run length(c(4,6,10)) in the chunk but later delete or comment it out. You might look at the output and say, “the code calculated 3, so the answer is 3.” But, this is not ok, since the code itself must reveal the number 3.

#length(c(4,6,10))
3

## [1] 3

NO! This is getting much closer, since we now have the fully code needed to generate the output. However, nothing is actually printed, so the output is not revealed. It is critical to always reveal the output!

answer <- length(c(4,6,10))

YES! This “fixes” the previous “NO!” example by revealing the output.

answer <- length(c(4,6,10))
answer

## [1] 3

YES! Only code is used to produce the output.

# YES!
length(c(4,6,10))

## [1] 3

YES! Only code is used to produce the output. This strategy uses improved coding style by defining a variable for the array, and then using length() to calculate the length of the variable.

# YES!
my_array <- c(4,6,10)
length(my_array)

## [1] 3

YES! Only code is used to produce the output. This strategy uses improved coding style by defining a variable for the array, and then defining another variable to contain the output from the length() calculation.

# YES!
my_array <- c(4,6,10)
the_length <- length(my_array)
the_length

## [1] 3

Unless the instructions explicitly ask you to define certain variables, all “YES!” code versions here are equivalent and entirely accceptable!

Instructions: Homework #2

Data Science for Biologists, Fall 2021

Complete the template Rmd and submit to Canvas on (POSTPONED) FRIDAY 9/24/21 by 11:59 PM

Instructions

Homework Questions

What NOT to do vs what TO do