Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(5) # Launch Homework 5
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
You will be conducting exploratory data analysis on three different datasets for this assignment. For each dataset, you must ask two exploratory/scientific questions, make a figure that addresses your question, and finally answer the question in a brief 1-2 sentences. There are five options for datasets to explore, and you must choose THREE AND ONLY THREE OF THEM!!
HUGE HINT!! BIGGEST HINT!! One way to think about this assignment is that you could first make some figures of the dataset (play around, move fast, break things, and plot stuff!). Once you have a figure you like, figure out: What does this figure tell me about the data, and how can I phrase that in the form of a question? “Reverse engineering” exploratory data analysis like this when you are first starting out is a great way to get comfortable exploring datasets. Spielman highly recommends making some plots in a separate R script or directly in the Console, and when you make a plot you like for the HW, pop it into the Rmarkdown template and add a question/answer accordingly.
Name of dataset
. Make sure to use backticks around the dataset name
nameofdataset_plotX
, where X
is either 1 or 2 for the two plots. Please see the templated example!aes()
wherever you want, as long as the plot works! You can include aesthetics on their own, within the ggplot()
call, or within the relevant geom
function. There is lots of flexibility for how you code aesthetic mappings, so use this opportunity to explore your coding style preference.iris
dataset, a scatterplot showing the relationship between sepal width and sepal length (the example plot in the template!) does not address the question, “How many of each species are in the dataset?” (even if the points are colored by species!!!).theme_gray()
) for each plot you make. There are several ways to accomplish non-default themes:
theme_set()
for your entire scripttheme_gray()
, but customize certain theme elementscolorblindr
function cvd_grid()
, as we have seen in class. Furthermore, each figure with a palette should have a DIFFERENT PALETTE! This is important so you can practice working with different types of palettes.You must choose THREE and ONLY THREE datasets, and ask two questions about each. Datasets are available in the
ds4b.materials
library.
pima
This dataset contains physical measurements from Pima Indian women from the American southwest. This population has been heavily studied by epidemiologists since they tend to have high levels of Type II Diabetes. Variables include:
npreg
: number of times the woman has been pregnantglucose
: plasma glucose concentration at 2 hours in an oral glucose tolerance test (units: mg/dL)dbp
: diastolic blood pressure (units: mm Hg)skin
: triceps skin-fold thickness (units: mm)insulin
: 2-hour serum insulin level (units: μU/mL)bmi
: Body Mass Indexage
: age in yearsdiabetic
: whether or not the individual has diabetesurine
This dataset contains urinalysis measurements (don’t worry about units) from 78 men, indicating whether traces of kidney stones (aka “crystal”) were detected their urine samples. Variables include:
crystal
: Whether calcium oxalate crystals (kidney stones) were detected. “No” means there are no kidney stones, and “Yes” means there are.gravity
: The specific gravity of the urinepH
: The pH of the urineosmo
: The osmolarity of the urineconduct
: The conductivity of the urineurea
: The urea concentrationdamselfly
This dataset details physical measurements from the collected samples of the damselfly, Ischnura ramburii, from the Hawaiian Islands Oahu, Kauai, and Hawaii. Researchers originally collected this data to study these damselflies’ unique color patterns: all males are blue-green, while some females are orange and some are blue-green like the males. The orange females are referred to as “gynomorph” (female-like morphs) and the blue-green females are referred to as “andromorphs” (male-like morphs). The dataset contains the following variables:
Island
, the Hawaiian island from which the individual damselfly was collectedSex
, the damselfly’s sexMorphology
, the damselfly’s morphology (color pattern, as described above - “gyno” for gynomorph, and “andro” for andropmorph)Wing.size
wing size, in unit pixelsMating.status
, whether or not the damselfly was a member of a mated pairAbdomen.length
length of abdomen in unit “mm”biopsy
This dataset contains biopsy measurements obtained from breast cancer biopsy samples at the University of Wisconsin Hospitals, Madison. Each of nine attributes about the tumor were scored on a scale of 1-10, and the outcome
column indicates whether the tumor was benign or malignant.
wine
This (familiar!) dataset contains information from a chemical analysis of three different cultivars (A, B, and C) of wine, including alcohol percentage and amounts of different chemical components. Variables include:
Cultivar
: The wine cultivar (A, B, or C)Alcohol
: The alcohol percentage of the wineMalicAcid
: The percentage of the wine that is malic acidAsh
: The percentage of the wine that is ashMagnesium
: The percentage of the wine that is magnesiumTotalPhenol
: The percentage of the wine that is phenolsFlavonoids
: The percentage of the wine that is flavonoidsNonflavPhenols
: The percentage of the wine that is non-flavonoid phenolsColor
: The color intensity of the wine, measured numerically