Obtaining and setting up the homework

Questions

Caution: This homework is VERY short, which means each question is worth a lot of points. This is both a blessing and a curse. Plan accordingly.

For your assignment, you will be building and evaluating a logistic regression from a dataset of various physical measurements from 752 adult Pima Native American women, some of whom are Type II Diabetic and some are not. Your goal for this assignment is to build and evaluate a model from this data to predict whether an individual has Type II Diabetes (column diabetic). Other columns in this dataset include:



  1. Prepare the data for modeling: The glm() function that we use to perform a logistic regression in R really really wants the response variable to contain values 0 and 1, where 0 corresponds to so-called “failure” and 1 corresponds to so-called “success.” In addition, glm() gets very upset when there are NAs anywhere in the data. Therefore, for this question, create a new version of the pima that contains no NAs, and the column diabetic contains 0 instead of “No” and 1 instead of “Yes” (we consider not diabetic as “failure” and diabetic as “success”). Save the tibble as pima_01, and then print it out. This is the tibble you will use for the rest of the homework!


  1. Build two candidate models: Rather than using model selection for this homework, you will practice logistic regression by comparing two potential candidate models. (There will not be any training/testing on this HW, but please remember that in the “real world” one should always perform build a model with a validation procedure!) For this question, build these two models and calculate their AUCs. In one markdown sentence, explain which model is preferred and why.

    • Candidate model 1 should have the following predictors:
      • glucose
      • dbp
      • skin
    • Candidate model 2 should have the following predictors:
      • insulin
      • skin
      • age


  1. Plot the ROC curves for both candidate models. Create a tibble of the data needed to plot both ROC curves (one for each candidate model), and then use that data to make a figure with two ROC curves. Curves should be colored by candidate model with a non-default palette, and the plot should include a guiding line representing “random chance.” Do not facet - curves should be in a single panel, distinguished only by color, as was shown in class.



  1. Plot the logistic curve for the preferred model. Now that you know which model is preferred (either Candidate 1 or Candidate 2!), visualize that model’s logistic curve. Create a tibble of the data needed to plot the logistic curve (as done in class!), and then create your plot which should have….
    • A black logistic curve.
    • Indications of the true diabetic status, EITHER as…
      • Transparent colored points based on diabetic status along the curve. Must be transparent enough to see the curve!
      • A colored point rug.
      • Either way, use non-default colors.
    • Ensure that your legend says “Diabetic”/“Not diabetic” instead of “1” and “0”. This will be part of your initial tibble creation with a little wrangling.



  1. Model metrics: Again considering only the preferred candidate model, determine the following give performance measures for this model using a success threshold of >=0.75. Specifically…
    • First, you will need to create the tibble that contains counts for TP, FP, TN, and FN at the 75% threshold. This tibble should have two columns and four rows (or if pivoted for some bonus points, four columns and 1 row - either is ok!). Save this tibble to pima_confusion and print it out.
    • Then, use information in your new pima_confusion tibble to calculate and print the five measures below.
      • For some bonus points, incorporate a tidyr pivot function into your code and perform all calculations using dplyr strategies. For regular full-credit without bonus, you are welcome to “use R as a calculator” using values seen in pima_confusion. But remember: No code, no credit. No arithmetic in your head.
    • You don’t have to write anything else in markdown, but ensure these calculated values are printed out, and your code contains COMMENTS that clearly state which calculation is which.
      • Accuracy
      • False discovery rate
      • True positive rate (aka sensitivity aka recall)
      • False positive rate
      • Positive predictive value (aka precision)



  1. Prediction: A new Pima woman has arrived! Use the preferred model to predict whether she is diabetic. Write code and then answer in a brief markdown sentence: What is the probability according to our model that she has diabetes, and at a 75% success threshold, would this model classify her as Diabetic or Not Diabetic?

    Hint: Do you need all this information? In fact you do not! For full credit, make sure your code is only considering predictors that are relevant for answering the question. In other words, your code should not use any data the model doesn’t need.

    • npreg: 4
    • glucose: 127
    • dbp: 92
    • skin: 28
    • insulin: 160
    • bmi: 31
    • age: 44