Obtain the homework template from your RStudio Cloud class project by running the following code in the R Console:
library(ds4b.materials) # Load the class library
launch_homework(10) # Launch Homework 10
You must set an RMarkdown theme and code syntax highlighting scheme of your choosing in the YAML front matter. These links will help you:
Make sure your Rmd knits without errors before submitting. If it does not produce an HTML output, this means it does not knit. DO NOT SKIP THIS STEP! Ensuring code runs without errors is MORE IMPORTANT than writing code in the first place.
As always, you are encouraged to work together and use the class Slack to help each other out, but you must submit YOUR OWN CODE.
Caution: This homework is VERY short, which means each question is worth a lot of points. This is both a blessing and a curse. Plan accordingly.
This assignment will use an dataset of various physical measurements from 250 adult males. Your goal for this assignment is to build and evaluate a model from this data to predict body fat percentage (column Percent
) in adult males, and then use this model to predict future outcomes. Age is measured in years, weight in pounds, height in inches, and all other measurements are circumference measured in cm.
id
in the dataset, which is a single identifying number for each male in the study. This is not “measured data”, and we definitely don’t want to end up modeling it. Create a new version of the dataset called bodyfat_noid
that does not contain the id
column. This is a one-line pipeline - don’t overthink! Use the bodyfat_noid
dataset for questions 2-5, but not 6.Perform model selection: Using the step()
function, determine the most appropriate model to explain variation in bodyfat percentage in this data, and then do ONE of the following (pick your favorite!): reveal the full model output with the function summary()
, reveal the full model output with the broom function tidy()
(you’ll need to load broom in the setup chunk!), OR reveal the model formula.
After the code, write a properly-formatted (make sure to knit and check!) bullet point list stating the final predictors. You do not need to include information about the specific coefficient values or P-values - just bullet out what the predictors are. Make sure you have properly formatted bullet points in your knitted output!! The purpose of this part of the question is for you to practice writing markdown bullet points and ensure they knit properly.
Evaluate your model with testing and training splits: Create a testing-training split where the training data comprises a random 65% subset of the total bodyfat
dataset. Build the model using the training data, and then evaluate how the model performs on both the training and testing splits. After the code, provide a brief explanation (1-2 sentences) about whether the model is likely overfit to the training data or not, and whether these results suggest the model will do a good or poor job predicting future outcomes.
Your code must have the following components:
setup
chunk. YOU MUST USE YOUR BANNER ID AS THE NUMBER, which you should have defined in the setup
code chunk!!! If you do not do this properly, there will be unpredictable mismatches between your written answer and the results your code produces when run. You must set your seed, and use it consistently.rsquare()
and rmse()
are part of the modelr
package - you will need to load that package in your setup chunk if it isn’t already loaded!Some data fun: Make either a faceted barplot or a lollipop plot to simply show your \(R^2\) and RMSE results. To do this, you need to create from scratch a properly structured tibble that contains columns you want to plot! I cannot recommend enough the need to draw this out first before proceeding. If you do not draw this out to plan, you will be stuck here forever. Please, draw. Draw the data, draw the plot, just draw!!
To achieve a plot, you will need to create a tibble with columns as follows (get_help("tibble")
). The tibble should end up having four rows and three columns called:
dataset
, which contains values “testing” and “training”metric
, which contains values “RMSE” and “R2” (don’t worry about superscript! “R2” is fine.)value
, which contains the relevant values for each metricThen, you should make a plot showing the value
for each dataset
, where the plot is faceted by metric
and uses separate axis scales for each facet (see the faceting notes from 10/6/21 on the class website!). You should also add a color or fill (depends on what type of plot you make) mapped to dataset
. The figure should be cleanly and professionally visible in the knitted Rmd, but you do not need to export it to a file with ggsave()
.
Prediction: A new man has arrived, and we want to use our model (as built with the training data split!) to predict his body fat percentage. Predict what the body fat percentage will be for a man with the following physical attributes. Your code should reveal the predicted body fat percentage.
Hint: Do you need all this information? In fact you do not! For full credit, make sure your code is only considering body measurements that are relevant for answering the question. In other words, your code should not use any data the model doesn’t need.
bodyfat
(not bodyfat_noid
) dataset to instead have four columns (id
, Percent
, measurement
, value
) as a longer version of the dataset. The measurement
column should be a character column of the different bodily measurements, and value
should contain the corresponding numeric values. Then, use this dataset to make a plot showing the relationship between Percent
and each of the other measurements, specifically only those that were predictors in the model (this means at some point in your wrangling, some data has to be removed). This figure should be faceted scatterplot where you are plotting Percent
against value
and faceting by measurement
. Each panel should contain its own line of best fit, and you may wish to use free panel scales. Display the figure legibly in the knitted Rmd, but you do NOT need to export to a file with ggsave()
. Style the figure as you like as long as labeling is clean! (Hint: This is wildly similar to the last question on HW9!!).