Enter your name here
A critical component to constructing robust and reliable models is model selection, that is, selecting the “optimal” model among possible models, so that your model achieves the most accurate predictions. Often, model selection is used to determine which predictors “should”, or “should not”, be included in a given model.
For this assignment, you will perform model selection by “backwards elimination,” a common step-wise approach to determine which predictor(s) “should” be used in a model. This approach works as follows:
lm(Y~ ., data = data)
(note the “.” on the right hand side of ~
).broom::tidy()
function is particularly helpful for quickly determining which P-values are significant.model.fit.variable %>% tidy() %>% filter(p.value > alpha) %>% arrange()
Importantly, throughout this process, you should consider only additive effects (not interaction effects!!) in your models. Also note that there are many R packages that perform this task in an automated fashion. Do not use them. You must build and evaluate these models yourself.
Throughout the assignment, the term “full model” will be used to refer to a model with all possible predictors, and the term “final model” will be used to refer to the final model produced after the step-wise backwards elimination. Most questions will prompt you to compare the full model with the final model, so be sure to save both to a variable so you can re-use them. Consider \(\alpha=0.05\) as significant throughout the assignment.
All questions in this assignment concern the dataset pima.csv
, which contains data from surveys of Pima Native American women’s health. Studies have shown that Pima women have increased incidences of Type II Diabetes relative to the general population. To identify possible underlying factors of diabetes in this population, researchers took various measurements of women with and without. Variables in the dataset include the following:
npregnancy
, the number of pregnanies the individual has had.glucose
, the plasma glucose concentration after 2 hours in an oral glucose tolerance test, in mg/dLbp
, diastolic blood pressure, in mmHgskin.thickness
, triceps skin fold thickness, in mminsulin
, 2-Hour serum insulin, in mu U/mlBMI
, body mass index, in weight in kg/(height in m)^2age
, in yearsdiabetic
, Yes or NoUse backwards elimination to construct a linear model to predicts BMI in Pima women, and answer the subsequent questions. You must show all steps along the way, specifically these:
broom::tidy()
output from each model, and the broom:glance()
output for the full and final models (output will be useful for question 3 below).summary(lm(..))
output. Stick to broom
functions!### Code to perform step-wise model selection goes here
1. (10 points) Provide a full interpretation of the full model, including interpretations for all coefficients and for \(R^2\). For any non-significant coefficients, explain what they would mean if they were significant.
Answer goes here.
2. (10 points) Provide a full interpretation of the final model, including interpretations for all coefficients and for the final model’s \(R^2\).
Answer goes here.
3. (5 points) Compare the adjusted \(R^2\) from the full model to the adjusted \(R^2\) from the final model. Specifically, based on these values, which model (if either) do you think has the most predictive power? What does this result tell you about the effect of non-significant predictors on \(R^2\)?
Answer goes here.
4.(10 points) Predict (with a 95% confidence interval) the BMI using both original model and the final model (i.e., make two separate predictions) for an individual with the following characteristics. For each prediction, be sure to only include the relevant predictors in the data frame you make to run the prediction.
### Code to predict BMI for each model goes here
5.(5 points) And now the reveal: The true BMI for this individual is 43.1 Which model gave the best prediction, if either? Based on your results, do you think that stepwise backwards elimination produced a “better” model, from the full to the final?
Answer goes here.
Use backwards elimination to construct a logistic regression model that predicts Diabetic status in Pima women, and answer the subsequent questions. You must show all steps along the way, specifically these:
broom::tidy()
output from each model (there is nothing relevant out of broom::glance()
, with regards to this homework). Again, no summary()
!Further note: the glm()
function will require that the response variable diabetic
is a factor. Therefore,if you receive an error when running glm()
you may have to write this variable in your glm()
call like this: glm(as.factor(diabetic) ~ .......)
.
### Code to perform step-wise model selection goes here
1. (5 points) Plot the logistic curve from the full model. Include points on the curve colored by diabetic status. In order to fully see all the points, you may wish to specify an alpha
(transparency level) to the points.
### Code to plot full model logistic curve goes here
2. (5 points) Plot the logistic curve from the final model. Include points on the curve colored by diabetic status. In order to fully see all the points, you may wish to specify an alpha
(transparency level) to the points.
### Code to plot final model logistic curve goes here
3. (10 points) Now, you will visualize the same data a bit differently: Plot overlayed density plots for the linear predictors (these are the values on the X-axis of the logistic curve) of the model, where densities are colored by diabetic status. Make this a faceted plot, where one facet shows densities for the full model and the other shows densities for the final model. Hint: To create a facetted plot, all data must be in the same data frame.
### Code to plot faceted densities goes here
4. (5 points) Based on the figures above (the two logistic curves and the density plots), do you think that either the full or final model did a “better job” separating the Diabetics? Why or why not?
Answer goes here.
6. (5 points) And now the reveal: In truth, this individual does not have Diabetes. Which model gave the best prediction, if either? Based on your results, do you think that stepwise backwards elimination produced a “better” model, from the full to the final?
Answer goes here.