Nonparametric tests and tidyr
Lecture
R Lesson
Guided exercises
The following is based on Chapter 12, “R for Data Science”.
We have several tables (built-in data in tidyr
) containing the same information:
table1
table2
table3
table4a
andtable4b
table5
Of these, only table1
is tidy: Each column is a variable and each row is an observation.
relative_rates
> table 1
# A tibble: 6 x 4
country year cases population
* <chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
We will now proceed to convert the remaining tables into tidy tables.
Tidying table2
The tibble table2
is not tidy because observations are spread across multiple rows. This calls for the verb spread()
.
> table2
# A tibble: 12 x 4
country year type count
* <chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
> table2 %>% spread(type, count)
# A tibble: 6 x 4
country year cases population
* <chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Tidying table3
The tibble table3
is not because there are two variables contained in the column rate
. This calls for the verb separate()
.
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> table3 %>% separate(rate, into=c("cases", "population"), sep = "/")
# A tibble: 6 x 4
country year cases population
* <chr> <int> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
## Notice how cases and population are **characters**? We should force the tidyverse to re-evaluate
> table3 %>% separate(rate, into=c("cases", "population"), sep = "/", convert = TRUE)
# A tibble: 6 x 4
country year cases population
* <chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Tidying table4a, table4b
The tibbles table4a
and table4b
together contain all information in table1
. Each of these is not tidy because it misuses values as variables, which calls for the verb gather()
. Further, to recreate table1
, we must use the dplyr
function left_join()
.
> table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
### Note the back-ticks to call columns that are **numbers**
> table4a %>% gather(year, cases, `1999`:`2000`) -> tidy4a
> tidy4a
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
> table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
### Note the back-ticks to call columns that are **numbers**
> table4b %>% gather(year, population, `1999`:`2000`) -> tidy4b
# A tibble: 6 x 3
country year population
<chr> <chr> <int>
1 Afghanistan 1999 19987071
2 Brazil 1999 172006362
3 China 1999 1272915272
4 Afghanistan 2000 20595360
5 Brazil 2000 174504898
6 China 2000 1280428583
### Finally, join together
> left_join(tidy4a, tidy4b)
# A tibble: 6 x 4
country year cases population
<chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Brazil 1999 37737 172006362
3 China 1999 212258 1272915272
4 Afghanistan 2000 2666 20595360
5 Brazil 2000 80488 174504898
6 China 2000 213766 1280428583
Tidying table5
The tibble table5
is not tidy because two variables are split into two columns (century and year). This calls for the verb unite()
. We also have a similar situation as table3
with the rate
column.
>table5
># A tibble: 6 x 4
country century year rate
* <chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
> table5 %>% unite(Year, century, year, sep="")
country Year rate
* <chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
### Finish it up as in table3
> table5 %>%
unite(Year, century, year, sep="") %>%
separate(rate, into=c("cases", "population"), sep = "/", convert=TRUE)
# A tibble: 6 x 4
country Year cases population
* <chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583