Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(11) # Launch Homework 11
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
Caution: This homework is VERY short, which means each question is worth a lot of points. This is both a blessing and a curse. Plan accordingly.
For your assignment, you will be building and evaluating a logistic regression from a dataset of various physical measurements from 752 adult Pima Native American women, some of whom are Type II Diabetic and some are not. Your goal for this assignment is to build and evaluate a model from this data to predict whether an individual has Type II Diabetes (column diabetic
). Other columns in this dataset include:
npreg
: number of times pregnantglucose
: plasma glucose concentration at 2 hours in an oral glucose tolerance test (units: mg/dL)dbp
: diastolic blood pressure (units: mm Hg)skin
: triceps skin-fold thickness (units: mm)insulin
: 2-hour serum insulin level (units: μU/mL)bmi
: Body Mass Indexage
: age in yearsdiabetic
: whether or not the individual has diabetes (“Yes” or “No”). (This is your “outcome” column!)glm()
function that we use to perform a logistic regression in R really really wants the response variable to contain values 0 and 1, where 0 corresponds to so-called “failure” and 1 corresponds to so-called “success.” In addition, glm()
gets very upset when there are NA
s anywhere in the data. Therefore, for this question, create a new version of the pima
that contains no NA
s, and the column diabetic
contains 0 instead of “No” and 1 instead of “Yes” (we consider not diabetic as “failure” and diabetic as “success”). Save the tibble as pima_01
, and then print it out. This is the tibble you will use for the rest of the homework!Build two candidate models: Rather than using model selection for this homework, you will practice logistic regression by comparing two potential candidate models. (There will not be any training/testing on this HW, but please remember that in the “real world” one should always perform build a model with a validation procedure!) For this question, build these two models and calculate their AUCs. In one markdown sentence, explain which model is preferred and why.
glucose
dbp
skin
insulin
skin
age
pima_confusion
and print it out.pima_confusion
tibble to calculate and print the five measures below.
tidyr
pivot function into your code and perform all calculations using dplyr
strategies. For regular full-credit without bonus, you are welcome to “use R as a calculator” using values seen in pima_confusion
. But remember: No code, no credit. No arithmetic in your head.Prediction: A new Pima woman has arrived! Use the preferred model to predict whether she is diabetic. Write code and then answer in a brief markdown sentence: What is the probability according to our model that she has diabetes, and at a 75% success threshold, would this model classify her as Diabetic or Not Diabetic?
Hint: Do you need all this information? In fact you do not! For full credit, make sure your code is only considering predictors that are relevant for answering the question. In other words, your code should not use any data the model doesn’t need.
npreg
: 4glucose
: 127dbp
: 92skin
: 28insulin
: 160bmi
: 31age
: 44