forcats::fct_lump_*()
get_help()
docs
The fct_lump_*()
functions are part of the {forcats}
package, which is part of the {tidyverse}
.
{forcats}
contains several related functions for lumping together categories (levels) in a factor variable to condense into fewer overall categories. Some of the most useful lumping functions include the following:
{forcats} function |
How it lumps |
---|---|
fct_lump_min() |
Lumps together all levels which occur fewer than a specified minimum number of times into a single level |
fct_lump_prop() |
Lumps together all levels which occur fewer than a specified proportion of times into a single level |
fct_lump_n() |
Lumps together all levels except for the specified n most frequenct levels |
These functions will combine all lumped levels into a single new level, “Other,” whose name can be customized. These functions are very useful for working with a factor variable that has a very high number of categories, which may be “overly-busy” to view all at once, especially in a plot.
Changing the order of factor levels is commonly performed to change axis order of a factor variable when using plotting with the {ggplot2}
library.
To use this function, you need to either first load the {forcats}
library, or always use the function with forcats::fct_lump_n()
notation.
# Load the library
library(forcats)
# Or, load the full tidyverse:
library(tidyverse)
# Or, use :: notation
::fct_lump_min()
forcats::fct_lump_prop()
forcats::fct_lump_n() forcats
fct_lump_min(factor_variable,
the minimum number of times, other_level = "Name you want to use if not 'Other'")
fct_lump_prop(factor_variable,
the minimum proportion of times,other_level = "Name you want to use if not 'Other'")
fct_lump_n(factor_variable,
-frequency levels NOT to lump,
the number of highother_level = "Name you want to use if not 'Other'")
The examples below use a modified version of the msleep
dataset called msleep_fctorder
. Learn more about this dataset with get_help("msleep")
.
In this modified dataset, the order
column has been coerced into a factor type (instead of character), and all NA
values have been removed from that column. (Notice below, the order
column is annotated <fct>
since it’s a factor).
# Show the modified msleep dataset, msleep_fctorder, with head()
head(msleep_fctorder)
## # A tibble: 6 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <chr> <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Owl … Aotus omni Prim… <NA> 17 1.8 NA 7 0.0155 0.48
## 2 Moun… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6 NA 1.35
## 3 Grea… Blar… omni Sori… lc 14.9 2.3 0.133 9.1 0.00029 0.019
## 4 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20 0.423 600
## 5 Thre… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6 NA 3.85
## 6 Nort… Call… carni Carn… vu 8.7 1.4 0.383 15.3 NA 20.5
# To help guide you through examples, use the base R function table() count the number of each order
# This shows us which categories are more/less frequent
table(msleep_fctorder$order)
##
## Afrosoricida Artiodactyla Carnivora Cetacea Chiroptera
## 1 5 7 1 2
## Cingulata Didelphimorphia Diprotodontia Erinaceomorpha Hyracoidea
## 2 2 2 2 3
## Lagomorpha Perissodactyla Pilosa Primates Rodentia
## 1 3 1 10 13
## Scandentia Soricomorpha
## 1 5
# Lump together all levels occurring fewer than 5 times
%>%
msleep_fctorder mutate(order = fct_lump_min(order, 5)) -> msleep_fctorder_ex1
# Show new levels to confirm they are updated: Notice the new category "Other" contains all the lumped
levels(msleep_fctorder_ex1$order)
## [1] "Artiodactyla" "Carnivora" "Primates" "Rodentia" "Soricomorpha" "Other"
# Lump together all levels occurring fewer than 5 times, name name new category "Fewer than 5x"
%>%
msleep_fctorder mutate(order = fct_lump_min(order, 5, other_level = "Fewer than 5x")) -> msleep_fctorder_ex2
# Show new levels to confirm they are updated: Notice the new category "Fewer than 5x" contains all the lumped
levels(msleep_fctorder_ex2$order)
## [1] "Artiodactyla" "Carnivora" "Primates" "Rodentia" "Soricomorpha"
## [6] "Fewer than 5x"
# Lump together all levels occurring in fewer than 10% of rows
%>%
msleep_fctorder mutate(order = fct_lump_prop(order, 0.1)) -> msleep_fctorder_ex3
# Show new levels to confirm they are updated
levels(msleep_fctorder_ex3$order)
## [1] "Carnivora" "Primates" "Rodentia" "Other"
# Lump together all levels EXCEPT the 3 most common levels
%>%
msleep_fctorder mutate(order = fct_lump_n(order, 3)) -> msleep_fctorder_ex4
# Show new levels to confirm they are updated
levels(msleep_fctorder_ex4$order)
## [1] "Carnivora" "Primates" "Rodentia" "Other"
# Without re-writing the column, change the levels for _plotting purposes only_
# Provide, for example, fct_lump_n(VARIABLE, the n value) to ggplot2::aes() to order in your plot
# This affects the x-axis labeling, so it is best practice to clean up with `labs()`
ggplot(msleep_fctorder) +
# Lump everything EXCEPT the top 4 levels, and use a custom name for the lumped level
aes(x = fct_lump_n(order, 4,
other_level = "Additional taxonomic orders"),
y = awake) +
geom_boxplot() +
labs(x = "order")