Resampling statistics and multiple testing

Lecture

R content

Broom vignette

Random sampling


### ALWAYS set your random seed
set.seed(1283)

## The base R function sample()
## Usage: sample(x, size, replace = FALSE, prob = NULL)

### Create a random array of 25 random numbers to demonstrate sampling
> x <- runif(25, 1, 10)
> x
 [1] 6.310782 6.495482 1.577173 4.507129 7.754845 4.143462 2.278142 8.749319
 [9] 8.680794 5.366593 5.210611 4.157654 6.626720 4.495312 9.374563 8.202714
[17] 9.211275 1.069301 6.555776 6.028585 4.490465 8.684786 5.733197 4.769792
[25] 5.339571

## Draw 10 random samples from a vector without replacement
> sample(x, 10)
 [1] 6.555776 5.339571 4.490465 4.143462 4.507129 5.733197 8.202714 9.374563
 [9] 6.626720 7.754845
	
	
## Draw 10 random samples from a vector with replacement
> sample(x, 10, replace=T)
 [1] 4.495312 2.278142 6.626720 4.490465 4.157654 4.769792 9.211275 1.069301
 [9] 1.069301 6.310782
### Note that 1.069301 appears twice, by random chance 

## Sampling from dataframes, using built-in dataset BOD
> BOD
  Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
5    5   15.6
6    7   19.8

## Sample three rows without replacement
> sample_n(BOD, 3)
  Time demand
1    1    8.3
4    4   16.0
3    3   19.0
> BOD %>% sample_n(3)
  Time demand
1    1    8.3
3    3   19.0
6    7   19.8

## Sample three rows with replacement
> BOD %>% sample_n(3, replace=T)
    Time demand
4      4   16.0
5      5   15.6
5.1    5   15.6

## Sample 75% of rows without replacement
> BOD %>% sample_frac(0.75)
  Time demand
3    3   19.0
1    1    8.3
5    5   15.6
4    4   16.0

## Sample 75% of rows without replacement
> BOD %>% sample_frac(0.75, replace=T)
    Time demand
3      3   19.0
1      1    8.3
1.1    1    8.3
2      2   10.3
> BOD %>% sample_frac(0.75, replace=T)
  Time demand
2    2   10.3
3    3   19.0
5    5   15.6
6    7   19.8

Permutation test

This example performs a permutation test between setosa and virginica iris sepal lengths. For clarity, functions are shown as their full names so you know for sure which package they come from.

Ho: Setosa and virginica have the same, on average, sepal lengths. Ha: Setosa and virginica have different, on average, sepal lengths.


### ALWAYS set your random seed
> set.seed(18472)

> library(tidyverse)
> library(broom)
> library(modelr) ## MUST LOAD LAST TO AVOID CONFLICT WITH BROOM FUNCTIONS

### Limit dataset to only have two categories in Species
> iris2 <- iris %>% filter(Species != "versicolor")

### Find the data t statistic
t.test(Sepal.Length~Species, data=iris2)$statistic -> data.t

### Choose a large number of permutations
> N <- 1e4

> iris.perms <- modelr::permute(iris2, N, Sepal.Length)
> iris.ttests <- purrr::map(iris.perms$perm, ~t.test(Sepal.Length~Species, data=.))
> iris.tidied <- purrr::map_df(iris.ttests, broom::tidy, .id = "id")
> iris.t.permuted <- iris.tidied$statistic

## Compute P
> (sum(iris.t.permuted >= abs(data.t)) + sum(iris.t.permuted <= -1*abs(data.t)))/length(iris.t.permuted)
[1] 0

## Examine data for directional conclusion
> iris2 %>% group_by(Species) %>% summarize(mean(Sepal.Length))
    Species `mean(Sepal.Length)`
1    setosa                5.006
2 virginica                6.588

We have a permutated P-value of 0, which means our true P-value is going to be extremely small, and definitively smaller than $\alpha = 0.05$ . We reject the null hypothesis and we conclude that setosa and virginica have different sepal lengths. We further find that virginica have longer sepal lengths than setosa, on average.

Bootstrap

This example compute the bootstrap expectation (i.e., mean) and 95% CI for setosa petal lengths.


### ALWAYS set your random seed
> set.seed(3342)

> library(slipper)

> N=1e4
> iris %>% 
	filter(Species == "setosa") %>%
	slipper(mean(Petal.Length),B=N) %>% 
	filter(type=="bootstrap") %>% 
	summarize(mean = mean(value),
		       ci_low = quantile(value,0.025), 
             ci_high = quantile(value,0.975))
     mean  ci_low ci_high
1 1.462159  1.414    1.51

With 1e4 bootstrap replicates, we estimate that setosa petal lengths having a mean of 1.46 with a 95% CI of [1.414, 1.51].

Multiple testing and corrections


### Use function p.adjust() to correct an array of P-values for multiple testing
#### p.adjust(p-value, method = correction method)

## Default uses the Holm correction
> my.pvalues <-c(0.05, 0.023, 0.001, 0.04, 0.035, 0.0007)
> p.adjust(my.pvalues)
[1] 0.1050 0.0920 0.0050 0.1050 0.1050 0.0042


## Use method argument to specify another method, such as fdr or bonferroni
> p.adjust(my.pvalues, method = "fdr")
[1] 0.050 0.046 0.003 0.048 0.048 0.003
> p.adjust(my.pvalues, method = "bonferroni")
[1] 0.3000 0.1380 0.0060 0.2400 0.2100 0.0042

Functions that perform multiple testing

pairwise.t.test(<response variable>, <grouping variable>) performs all possible t-tests for specified columns
pairwise.wilcox.test(<response variable>, <grouping variable>) performs all possible Mann Whitney U tests for specified columns


## Default
> pairwise.t.test(iris$Petal.Length, iris$Species)
	Pairwise comparisons using t tests with pooled SD 

data:  iris$Petal.Length and iris$Species 

           setosa versicolor
versicolor <2e-16 -         
virginica  <2e-16 <2e-16    

P value adjustment method: holm

## Use FDR
> pairwise.t.test(iris$Petal.Length, iris$Species, p.adj = "fdr")
	Pairwise comparisons using t tests with pooled SD 

data:  iris$Petal.Length and iris$Species 

           setosa versicolor
versicolor <2e-16 -         
virginica  <2e-16 <2e-16    

P value adjustment method: fdr 

## Tidy it up with broom::tidy()
> pairwise.t.test(iris$Petal.Length, iris$Species, p.adj = "fdr") %>% tidy()
      group1     group2      p.value
1 versicolor     setosa 7.881881e-69
2  virginica     setosa 1.231842e-90
4  virginica versicolor 1.810597e-31