Inclass exercises solutions, Day 9

Follow this link for day 9 materials and notes

ANOVA

Using the built-in dataset PlantGrowth, run an ANOVA to determine if plant weight differs across treatments.

Solution


> head(PlantGrowth)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

> fit <- aov(weight ~ group, data = PlantGrowth)
> summary(fit)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

### For demonstrative purposes, run a Tukey-Kramer test. Not necessary for every ANOVA!
> TukeyHSD(fit)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = PlantGrowth)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

Our null for the ANOVA is that all plant treatments have the same mean weight. Our alternative hypothesis is that at least one group has a different weight. ANOVA gives F = 4.806 with P=0.0159 which is significant at alpha = 0.05. We reject the null hypothesis and find evidence that at least one plant group has a different mean weight.

A post-doc Tukey test shows that trt1 and trt2 have significantly different weights (P=0.012) but that there is no evidence for weight differences for other group comparisons.

Pearson’s Correlation

Using the dataset wine.csv, run a correlation test to determine the strength and magnitude of the relationship between flavanoid and total phenol content across all wine cultivars.

Solution


> wine <- read.csv("wine.csv")
> head(wine)
  Cultivar Alcohol MalicAcid  Ash Magnesium TotalPhenol Flavanoids
1        1   14.23      1.71 2.43       127        2.80       3.06
2        1   13.20      1.78 2.14       100        2.65       2.76
3        1   13.16      2.36 2.67       101        2.80       3.24
4        1   14.37      1.95 2.50       113        3.85       3.49
5        1   13.24      2.59 2.87       118        2.80       2.69
6        1   14.20      1.76 2.45       112        3.27       3.39
  NonflavPhenols Color
1           0.28  5.64
2           0.26  4.38
3           0.30  5.68
4           0.24  7.80
5           0.39  4.32
6           0.34  6.75

## Check linearity
> ggplot(wine,aes(x = Flavanoids, y = TotalPhenol))+geom_point()

## Run correlation
> cor.test(wine$Flavanoids, wine$TotalPhenol)

	Pearson's product-moment correlation

data:  wine$Flavanoids and wine$TotalPhenol
t = 22.824, df = 176, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8220088 0.8975164
sample estimates:
      cor 
0.8645635

The Pearson correlation between Flavanoids and TotalPhenol is R=0.86, and is significantly different from 0 (P<2.2e-16). Therefore these two quantities have strong and positive linear relationship.

Spearman’s Correlation

Using the ggplot2 dataset diamonds (you can access it like any built-in dataset if ggplot2 is loaded), run a non-parametric correlation test between diamond carat and price, specifically for diamonds with quality “Good”.

Solutions


> head(diamonds)
# A tibble: 6 x 10
  carat       cut color clarity depth table price     x     y     z
1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

#### Subset for correlation
diamonds %>% filter(cut == "Good") -> good

## Check linearity
> ggplot(good,aes(x = price, y = carat))+geom_point()

## Run correlation
> cor.test(good$price, good$carat, method = "spearman")
	Spearman's rank correlation rho

data:  good$price and good$carat
S = 787830000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.9599684

Diamond carat and price are not linearly related, as seen through plotting, so we must use a nonparametric correlation. The Spearman correlation between Flavanoids and TotalPhenol is R=0.96, and is significantly different from 0 (P<2.2e-16). Therefore these two quantities have strong and positive (but not linear!) relationship.

Regression

Using the dataset wine.csv, construct a linear model that can predict the amount of Flavanoids from the amount of non-flavanoid phenols (NonflavPhenols) across all wine cultivars. Then, predict the Flavanoid content for a wine with non-flavanoid phenols of 0.12.



## Check linearity with response on y axis
> ggplot(wine,aes(x = Flavanoids, y = NonflavPhenols))+geom_point()

## Fit model
> fit <- lm(NonflavPhenols ~ Flavanoids, data = wine)
> summary(fit)
Call:
lm(formula = NonflavPhenols ~ Flavanoids, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.29151 -0.06631 -0.01409  0.06859  0.31261 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.497855   0.017897  27.817  < 2e-16 ***
Flavanoids  -0.067020   0.007917  -8.465 9.74e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1052 on 176 degrees of freedom
Multiple R-squared:  0.2893,	Adjusted R-squared:  0.2853 
F-statistic: 71.66 on 1 and 176 DF,  p-value: 9.739e-150.9599684 


### Note the broom package!
> tidy(fit)
         term    estimate   std.error statistic      p.value
1 (Intercept)  0.49785537 0.017897438  27.81713 2.490523e-66
2  Flavanoids -0.06701989 0.007917322  -8.46497 9.739231e-15

> glance(fit)
  r.squared adj.r.squared     sigma statistic      p.value df   logLik
1  0.289336     0.2852981 0.1052129  71.65571 9.739231e-15  2 149.2495
        AIC       BIC deviance df.residual
1 -292.4991 -282.9537 1.948277         176

Check linearity:

Add regression line:

Our model shows a moderate but significant relationship between flavanoids and non-flavanoid phenols. Our model’s $R^2 = 0.289$ (P=9.74e-15) is highly significant and indicates that flavanoid content can explain ~29% of variation in non-flavanoid phenols. Therefore, the model does not explain the majority of variation in non-flavanoid phenols.

The flavanoid coefficient is -0.067 (P=9.74e-15), meaning that for every unit increase in flavanoid content, we expect non-flavanoid phenols to decrease by 0.067 units on average. The intercept is 0.498 (P=2.49e-66), so wine with 0 flavanoid content is expected to contain 0.497 units of non-flavanoid phenols, although this quantity is probably not meaningful in the context of actual wine.