This dataset, called coffee_ratings
, provides ratings of different coffee beans made around the world. Each row is a single coffee bean, and each column provides different information about that coffee bean including its total rating (variable total_cup_points
, on a scale of 100), what species and variety of coffee it is, information about where the coffee comes from, information about the number of defects detected per coffee, and different graded measurements, on a scale of 10, for various coffee attributes (like aroma or sweetness). It has been subsetted (to only 20 rows and 14 columns) from the full dataset here.
You will see some entries in the dataset that are NA
, which implies a missing value. Sometimes, data is missing or unknown, and it doesn’t necessarily imply anything is wrong with the dataset. Instead, it implies that data collection can be challenging, and you can’t always get all the information you want.
Use the “Show [] entries” dropdown menu below to see all 20 rows.
Scroll to the right to see all columns.
You can use the “Search” field to find values or variables in the data. CAUTION: Does not work to search for
NA
s.Click on a column header to sort the data by that column.
Answer these questions about the dataset variables and how one might visualize the data. When referring to variable names in the data, you must be absolutely precise with spelling and capitalization.
For example, there is no variable called Total Cup Points
but there is a variable called total_cup_points
.
You do not need to write in complete sentences except for one of your answers in Set 2 (see specific instructions below).
Instructions: For these questions, simply list the variables.
Which variable(s), if any, in this dataset is/are categorical?
Which variable(s), if any, in this dataset is/are numeric continuous?
Which variable(s), if any, in this dataset is/are numeric discrete?
For which variable(s), if any, do we NOT know all of the values? (in other words, there is missing data)?
Instructions: For these questions, indicate if the presented information is a variable or a value in the dataset. If it is a variable, provide one of the values associated with the variable (don’t make it up - find a value in the data!). If it is a value, tell me what variable the value belongs to.
For all items below, you CAN answer the first part (variable or value?). But, for ONE AND ONLY ONE of the items below, you can’t definitively answer the second part of the question (if variable give me a value, or if value give me the variable). For this one ambiguous item, explain in 1 sentence why it is ambiguous.
As an example,
total_cup_points
is a variable, and one of its values is 83.3. By contrast, “Arabica” is a value that is associated with the variablespecies
.Hint: Use the search bar above the table to help you find values!
variety
India
1
Caturra
veracruz
sweetness
0.12
83.25
farm_name
Instructions: For these questions, simply answer whether the choice of data visualization is appropriate or not with “yes” or “no”. Your answers should be based ONLY on whether the type of plot is technically appropriate for visualizing those data types (don’t worry about whether it might “look pretty” or not).
You make a scatterplot to visualize the distribution of sweetness
coffee rating values.
You make a barplot to visualize how many coffees come from each country_of_origin
in the dataset.
You make a histogram to visualize the distribution of flavor
rating values.
You make a violin plot to visualize the distribution of acidity
rating values for each species
of coffee.
You make a scatterplot to visualize the relationship between coffee balance
and sweetness
rating values.
You make a boxplot to visualize how many coffees of each species
there are.
You make a density plot to visualize the relationship between the number_of_bags
produced and the number_of_defects
found for each coffee.
You make a strip/jitter plot to visualize the full distributions of flavor
ratings for coffee grown in each region
.
You make a barplot to visualize the full distribution of total_cup_points
coffee ratings.
These questions consider five different datasets that are not coffee_ratings
, as shown below. For dataset, indicate if it is tidy or not by answering simply yes or no. Then…
Recall that tidy data has all of these qualities:
coffee_ratings
,”Arabica" can be one of species
.coffee_ratings
contains information about ONE recorded coffee bean.This dataset shows results from an experiment assessing the effect of different pesticides (here, called A
, B
, …F
) on insect control in agriculture. Different plots were sprayed with each pesticide, and the number of insects in each plot were counted after a certain period of time.
This dataset shows results from an experiment assessing oxygen use in seals when they dive. Researchers were testing the hypothesis that seals use more oxygen during feeding dives (i.e., dive to hunt and eat) compared to non-feeding dives (i.e., a dive without hunting/eating). Oxygen usage was measured twice for each seal: Once on a non-feeding dive on once on a feeding dive.
This dataset shows partial results from a case-control study of esophageal cancer which enrolled individuals with esophageal cancer (“cases”) and individuals without cancer (“controls”). Researchers obtained data from each patient including their age, and how much alcohol and tobacco they consume per day (don’t worry about units). This data shows the how many patients in each combination of age, alcohol usage, and tobacco usage groups were enrolled.
This dataset shows the number of monthly deaths from 1974-1979 from lung disease, including bronchitis, emphysema and asthma, in the UK.
This dataset shows the number of tuberculosis cases out of the total population in three different countries as recorded in years 1999 and 2000.