Obtaining and setting up the homework

Instructions

You will be conducting exploratory data analysis on three different datasets for this assignment. For each dataset, you must ask two exploratory/scientific questions, make a figure that addresses your question, and finally answer the question in a brief 1-2 sentences. There are five options for datasets to explore, and you must choose THREE AND ONLY THREE OF THEM!!

HUGE HINT!! BIGGEST HINT!! One way to think about this assignment is that you could first make some figures of the dataset (play around, move fast, break things, and plot stuff!). Once you have a figure you like, figure out: What does this figure tell me about the data, and how can I phrase that in the form of a question? “Reverse engineering” exploratory data analysis like this when you are first starting out is a great way to get comfortable exploring datasets. Spielman highly recommends making some plots in a separate R script or directly in the Console, and when you make a plot you like for the HW, pop it into the Rmarkdown template and add a question/answer accordingly.

Rules for plotting

  • There is no requirement to make certain types of plots. You are not specifically required to make a different plot for each question. There will be no deductions for repeated plots/geoms.
  • You can place aes() wherever you want, as long as the plot works! You can include aesthetics on their own, within the ggplot() call, or within the relevant geom function. There is lots of flexibility for how you code aesthetic mappings, so use this opportunity to explore your coding style preference.
  • You should make the type of plot that, in your opinion, is able to address your exploratory/scientific question. There will be deductions if your plot is not at all related to your question.
    • For example, considering the iris dataset, a scatterplot showing the relationship between sepal width and sepal length (the example plot in the template!) does not address the question, “How many of each species are in the dataset?” (even if the points are colored by species!!!).
  • As in HW4, you must save plots to variables, and only reveal plots by printing the variable.
  • Although you can make any kind of plot you want, your plots MUST have the following features:
    • You must use a non-default theme (as in, do not use default theme_gray()) for each plot you make. There are several ways to accomplish non-default themes:
      • Set a different theme for each plot
      • Set a different theme with theme_set() for your entire script
      • Use theme_gray(), but customize certain theme elements
    • All figures that contain mapped colors and/or fills must use a non-default palette, and all figures with manual colors must be colorblind friendly. You can check they are colorblind friendly with the colorblindr function cvd_grid(), as we have seen in class. Furthermore, each figure with a palette should have a DIFFERENT PALETTE! This is important so you can practice working with different types of palettes.
    • All plots must be visible and fully legible at a reasonable size, and the figure should not appear “stretched” or “squished.” By default, plots will be sized 4 inches tall and 6 inches wide. If you want to change it, change it for the individual chunk, which may require some trial and error to get it sized properly.
    • Ensure customized and professional axis labels. There should never be underscores, for example, in an axis label. You do not need to include a title/subtitle/caption for any figure, but you may if you want to.

Datasets to explore

You must choose THREE and ONLY THREE datasets, and ask two questions about each. Datasets are available in the ds4b.materials library.

Dataset 1: pima

This dataset contains physical measurements from Pima Indian women from the American southwest. This population has been heavily studied by epidemiologists since they tend to have high levels of Type II Diabetes. Variables include:

  • npreg : number of times the woman has been pregnant
  • glucose : plasma glucose concentration at 2 hours in an oral glucose tolerance test (units: mg/dL)
  • dbp : diastolic blood pressure (units: mm Hg)
  • skin : triceps skin-fold thickness (units: mm)
  • insulin : 2-hour serum insulin level (units: μU/mL)
  • bmi : Body Mass Index
  • age : age in years
  • diabetic : whether or not the individual has diabetes


Dataset 2: urine

This dataset contains urinalysis measurements (don’t worry about units) from 78 men, indicating whether traces of kidney stones (aka “crystal”) were detected their urine samples. Variables include:

  • crystal : Whether calcium oxalate crystals (kidney stones) were detected. “No” means there are no kidney stones, and “Yes” means there are.
  • gravity : The specific gravity of the urine
  • pH : The pH of the urine
  • osmo : The osmolarity of the urine
  • conduct : The conductivity of the urine
  • urea : The urea concentration


Dataset 3: damselfly

This dataset details physical measurements from the collected samples of the damselfly, Ischnura ramburii, from the Hawaiian Islands Oahu, Kauai, and Hawaii. Researchers originally collected this data to study these damselflies’ unique color patterns: all males are blue-green, while some females are orange and some are blue-green like the males. The orange females are referred to as “gynomorph” (female-like morphs) and the blue-green females are referred to as “andromorphs” (male-like morphs). The dataset contains the following variables:

  • Island, the Hawaiian island from which the individual damselfly was collected
  • Sex, the damselfly’s sex
  • Morphology, the damselfly’s morphology (color pattern, as described above - “gyno” for gynomorph, and “andro” for andropmorph)
  • Wing.size wing size, in unit pixels
  • Mating.status, whether or not the damselfly was a member of a mated pair
  • Abdomen.length length of abdomen in unit “mm”


Dataset 4: biopsy

This dataset contains biopsy measurements obtained from breast cancer biopsy samples at the University of Wisconsin Hospitals, Madison. Each of nine attributes about the tumor were scored on a scale of 1-10, and the outcome column indicates whether the tumor was benign or malignant.


Dataset 5: wine

This (familiar!) dataset contains information from a chemical analysis of three different cultivars (A, B, and C) of wine, including alcohol percentage and amounts of different chemical components. Variables include:

  • Cultivar: The wine cultivar (A, B, or C)
  • Alcohol: The alcohol percentage of the wine
  • MalicAcid: The percentage of the wine that is malic acid
  • Ash: The percentage of the wine that is ash
  • Magnesium: The percentage of the wine that is magnesium
  • TotalPhenol: The percentage of the wine that is phenols
  • Flavonoids: The percentage of the wine that is flavonoids
  • NonflavPhenols: The percentage of the wine that is non-flavonoid phenols
  • Color: The color intensity of the wine, measured numerically