Homework 11: Model Selection and Validation (100 points)

BIO5312

Due 11/14/17 by 5:30 pm

Enter your name here


Consider \(\alpha=0.05\) as significant throughout this assignment.

Part One


This section uses the dataset bodyfat.csv, which contains various physical measurements from ~250 adult men. Variables in the dataset include the following:


Question 1 (15 points). Use the step() function to construct a linear model that predicts body fat percentage in men (we will call this the “step-wise model”).

Be sure to use the argument trace=0 when running step, i.e.: step(lm(Y~X, data=data), trace=0). This will reduce the amount of unnecessary output. You will lose points if you don’t include this argument.

Then, provide an answer with the following components:

### Code goes here


Answer goes here.



Question 2 (15 points). Perform a likelihood ratio test (LRT) between two bodyfat percentage models: the step-wise model determined in question 1, and a second model with all those same predictors and an added effect that is an interaction between the two most significant predictors in the step-wise model.

In other words, imagine an iris model as lm(Petal.Length ~ Sepal.Width + Sepal.Length + Species, data = iris). To add in an interaction between Sepal.Width and Sepal.Length, we would do this: lm(Petal.Length ~ Sepal.Width + Sepal.Length + Species + Sepal.Width:Sepal.Length, data = iris).

Then, provide an answer with the following components:


### Code goes here


Question 3 (20 points). Using only the predictors in the step-wise model (not the model from question 2), perform a k-fold cross validation with K=10 to predict bodyfat percentage. Make sure to set your seed first! For each trained model, calculate RMSE for both the respective training and testing data. Visualize the final RMSE distributions (there will be 20 values, 10 from test data and 10 from training data) as boxplots, in a single call to ggplot(). Based on your results, explain how robust the model is.


### Code goes here


Answer goes here.



Part Two

This section uses the dataset mammogram.csv, which contains mammogram results, and final diagonses, for 831 women. Variables in the dataset include the following:


Question 1 (25 points). Construct a logistic regression, using the full mammogram dataset, to predict breast cancer malignancy. Once the model is made, make two figures to accompany this model: an ROC curve and a plot of the logistic curve fitted (for the latter plot, show colored points). Then, provide an explanation which includes the following components:


### Code goes here


Question 2 (25 points). Perform a k-fold cross validation with K=10 to predict breast cancer incidence, using all predictors. Make sure to set your seed first! For each trained model, calculate AUC for both the respective training and testing data. Visualize the final AUC distributions (there will be 20 values, 10 from test data and 10 from training data) as violin plots, in a single call to ggplot(). Based on your results, explain how robust the model is.


### Code goes here


Answer goes here.



OPTIONAL BONUS QUESTION (20 points). Make violin plots of precision and accuracy across the 10 folds, using a tidyverse-oriented strategy to obtain these values. An example for how to perform something similar can be found in the K-folds supplement on the course website. Additionally report and mean and standard deviation for these quantities.


### Code goes here