Homework Dataset Background

This dataset, called coffee_ratings, provides ratings of different coffee beans made around the world. Each row is a single coffee bean, and each column provides different information about that coffee bean including its total rating (variable total_cup_points, on a scale of 100), what species and variety of coffee it is, information about where the coffee comes from, information about the number of defects detected per coffee, and different graded measurements, on a scale of 10, for various coffee attributes (like aroma or sweetness). It has been subsetted (to only 20 rows and 14 columns) from the full dataset here.

You will see some entries in the dataset that are NA, which implies a missing value. Sometimes, data is missing or unknown, and it doesn’t necessarily imply anything is wrong with the dataset. Instead, it implies that data collection can be challenging, and you can’t always get all the information you want.

Explore the dataset

Use the “Show [] entries” dropdown menu below to see all 20 rows.

Scroll to the right to see all columns.

You can use the “Search” field to find values or variables in the data. CAUTION: Does not work to search for NAs.

Click on a column header to sort the data by that column.

Specific Instructions:

Answer these questions about the dataset variables and how one might visualize the data. When referring to variable names in the data, you must be absolutely precise with spelling and capitalization.

For example, there is no variable called Total Cup Points but there is a variable called total_cup_points.

You do not need to write in complete sentences except for one of your answers in Set 2 (see specific instructions below).

Set 1

Instructions: For these questions, simply list the variables.

  1. Which variable(s), if any, in this dataset is/are categorical?

  2. Which variable(s), if any, in this dataset is/are numeric continuous?

  3. Which variable(s), if any, in this dataset is/are numeric discrete?

  4. For which variable(s), if any, do we NOT know all of the values? (in other words, there is missing data)?



Set 2

Instructions: For these questions, indicate if the presented information is a variable or a value in the dataset. If it is a variable, provide one of the values associated with the variable (don’t make it up - find a value in the data!). If it is a value, tell me what variable the value belongs to.

For all items below, you CAN answer the first part (variable or value?). But, for ONE AND ONLY ONE of the items below, you can’t definitively answer the second part of the question (if variable give me a value, or if value give me the variable). For this one ambiguous item, explain in 1 sentence why it is ambiguous.

As an example, total_cup_points is a variable, and one of its values is 83.3. By contrast, “Arabica” is a value that is associated with the variable species.

Hint: Use the search bar above the table to help you find values!

  1. variety

  2. India

  3. 1

  4. Caturra

  5. veracruz

  6. sweetness

  7. 0.12

  8. 83.25

  9. farm_name



Set 3

Instructions: For these questions, simply answer whether the choice of data visualization is appropriate or not with “yes” or “no”. Your answers should be based ONLY on whether the type of plot is technically appropriate for visualizing those data types (don’t worry about whether it might “look pretty” or not).

  1. You make a scatterplot to visualize the distribution of sweetness coffee rating values.

  2. You make a barplot to visualize how many coffees come from each country_of_origin in the dataset.

  3. You make a histogram to visualize the distribution of flavor rating values.

  4. You make a violin plot to visualize the distribution of acidity rating values for each species of coffee.

  5. You make a scatterplot to visualize the relationship between coffee balance and sweetness rating values.

  6. You make a boxplot to visualize how many coffees of each species there are.

  7. You make a density plot to visualize the relationship between the number_of_bags produced and the number_of_defects found for each coffee.

  8. You make a strip/jitter plot to visualize the full distributions of flavor ratings for coffee grown in each region.

  9. You make a barplot to visualize the full distribution of total_cup_points coffee ratings.



Set 4

These questions consider five different datasets that are not coffee_ratings, as shown below. For dataset, indicate if it is tidy or not by answering simply yes or no. Then…

  • If the data is tidy, state whether each variable is categorical, numeric continuous, or numeric discrete.
  • If the data is not tidy, state which of the three qualities of tidy data are violated (one or more can be violated at a time!). You can simply use 1, 2, and/or 3, as described below…

Recall that tidy data has all of these qualities:

  1. All columns are variables
    • This means for each value in a column, you should be able to say “this value can be one of . For example, considering coffee_ratings,”Arabica" can be one of species.
  2. All rows contain only a single observation
    • Each row should contain ONE measured data point. For example, each row in coffee_ratings contains information about ONE recorded coffee bean.
    • Hint: This means that, if the data records an experimental result with multiple measured results for a given individual, those results should be on separate rows.
  3. Each value is a single value; multiple values should not be stored in a single table cell.


Dataset 1

This dataset shows results from an experiment assessing the effect of different pesticides (here, called A, B, …F) on insect control in agriculture. Different plots were sprayed with each pesticide, and the number of insects in each plot were counted after a certain period of time.


Dataset 2

This dataset shows results from an experiment assessing oxygen use in seals when they dive. Researchers were testing the hypothesis that seals use more oxygen during feeding dives (i.e., dive to hunt and eat) compared to non-feeding dives (i.e., a dive without hunting/eating). Oxygen usage was measured twice for each seal: Once on a non-feeding dive on once on a feeding dive.


Dataset 3

This dataset shows partial results from a case-control study of esophageal cancer which enrolled individuals with esophageal cancer (“cases”) and individuals without cancer (“controls”). Researchers obtained data from each patient including their age, and how much alcohol and tobacco they consume per day (don’t worry about units). This data shows the how many patients in each combination of age, alcohol usage, and tobacco usage groups were enrolled.


Dataset 4

This dataset shows the number of monthly deaths from 1974-1979 from lung disease, including bronchitis, emphysema and asthma, in the UK.


Dataset 5

This dataset shows the number of tuberculosis cases out of the total population in three different countries as recorded in years 1999 and 2000.