Probability Distributions and Introduction to Statistical Inference

Lecture

Slides

R content

There is no separate R lab for today’s lecture. Instead, R code is (mostly) integrated into the slides. See below for an overview of all code presented in the slides.

R distribution functions

Function Template	Purpose	Binomial	Normal
`dxxx()`	Returns the density (height of PMF/PDF)	`dbinom(k, n, p)`	`dnorm(x, mean, sd`)
`pxxx()`	Returns cumulative probability, P(X<=x)	`pbinom(k, n, p)`	`pnorm(q, mean, sd)`
`qxxx()`	Returns the quantile (x-coordinate) from a given CDF probability	`qbinom(k, n, p)`	`qnorm(p, mean, sd`)
`rxxx()`	Generate N random numbers from distribution	`rbinom(N, n, p)`	`rnorm(N, mean, sd)`

Note that all functions for the Normal distribution will assume standard normal if mean and standard deviation arguemnts are not provided. For example, R will interpret pnorm(5) as pnorm(5,0,1).

New ggplot functions

Function	Purpose	Example
`geom_text()`	Add text to a plot	`geom_text( x=xcoordinate, y=ycoordinate, label = "text")` (Note, these can all be aesthetics)
`geom_vline()`	Add a vertical line to a plot	`geom_vline(xintercept=3)`
`geom_hline()`	Add a horizontal line to a plot	`geom_hline(yintercept=3)`
`geom_abline()`	Add a generic line of formula y=ax+b to a plot	`geom_abline(yintercept=3, slope = -2)`
`scale_x_continuous()`	Customize a continuous x-axis	`scale_x_continuous(name = "axis label", breaks=c(...), labels=c(...))`
`scale_y_continuous()`	Customize a continuous y-axis	`scale_y_continuous(name = "axis label", breaks=c(...), labels=c(...))`

Examples

Reading and writing files with readr


library(readr)

### Read a csv (comma-separated values) ###
data <- read_csv("file.csv")

### Read a tsv (tab-separated values) ###
data <- read_tsv("file.tsv")

### Write a csv ###
example <- tibble(x = 1:10, y = 2:11)
write_csv(example, "file.csv")

### Write a tsv ###
example <- tibble(x = 1:10, y = 2:11)
write_tsv(example, "file.tsv")

Drawing random sample of rows with dplyr


## Sample 50 random rows from iris without replacement
iris %>% sample_n(50) -> iris50

## Sample 50 random rows from iris with replacement
iris %>% sample_n(50, replace=TRUE) -> iris50

## Sample 50%*nrow(iris) random rows from iris without replacement
iris %>% sample_frac(0.5) -> iris125

## Sample 50%*nrow(iris) random rows from iris with replacement
iris %>% sample_frac(0.5, replace=TRUE) -> iris125

Visualize PMF using example X~B(6, 0.7) (ie, n=6, p=0.7)


library(purrr) ## Necessary for the map_dbl() function in step one

## Step One: Generate the data frame of probabilities for all k in 0:n
n <- 6
p <- 0.7
data.pmf <- tibble(k = 0:n, prob = map_dbl(0:n, dbinom, n, p))

## Step Two: make a bar plot with stat="identity" to make bar heights equal to their value in the data frame
ggplot(data.pmf, aes(x = k, y = prob)) + 
	geom_bar(stat = "identity") + 
	scale_x_continuous(name = "Successes", breaks=0:n) + 
	ylab("Probability") + 
	ggtitle("PMF for X~B(6, 0.7)")

Visualize CDF using example X~B(6, 0.7)


### Note that this plotting framework is generally applicable to both discrete and continuous CDF. ###

## Step One: Generate random numbers from the distribution of interest. Here, we generate 5000
data.cdf <- tibble(x = rbinom(5000, 6, 0.7))

## Step Two: plot with stat_ecdf()
ggplot(data.cdf, aes(x = x)) + 
	stat_ecdf() + 
	xlab("Successes") + 
	ylab("Cumulative Probability") + 
	ggtitle("CDF for X~B(6, 0.7)")

Visualize PDF using example X~N(2, 1.5) (i.e. mean=2, variance=1.5)


## Step One: Create a data frame giving the x-axis range. +/- 3sd is usually ok
range.pdf <- tibble(x = c(2 - 4, 2 + 4))

## Step Two: plot with stat_function
ggplot(range.pdf, aes(x = x)) + 
	stat_function(fun = dnorm, args = list(mean = 2, sd=sqrt(1.5) )) + 
	xlab("X") + 
	ylab("Probability") + 
	ggtitle("PDF for X~N(2, 1.5)")