forcats::fct_lump_*()
   get_help() docs


Description

The fct_lump_*() functions are part of the {forcats} package, which is part of the {tidyverse}.

{forcats} contains several related functions for lumping together categories (levels) in a factor variable to condense into fewer overall categories. Some of the most useful lumping functions include the following:

{forcats} function How it lumps
fct_lump_min() Lumps together all levels which occur fewer than a specified minimum number of times into a single level
fct_lump_prop() Lumps together all levels which occur fewer than a specified proportion of times into a single level
fct_lump_n() Lumps together all levels except for the specified n most frequenct levels

These functions will combine all lumped levels into a single new level, “Other,” whose name can be customized. These functions are very useful for working with a factor variable that has a very high number of categories, which may be “overly-busy” to view all at once, especially in a plot.

Changing the order of factor levels is commonly performed to change axis order of a factor variable when using plotting with the {ggplot2} library.

To use this function, you need to either first load the {forcats} library, or always use the function with forcats::fct_lump_n() notation.

# Load the library
library(forcats)
# Or, load the full tidyverse:
library(tidyverse)

# Or, use :: notation
forcats::fct_lump_min()
forcats::fct_lump_prop()
forcats::fct_lump_n()

Conceptual Usage

fct_lump_min(factor_variable, 
             the minimum number of times, 
             other_level = "Name you want to use if not 'Other'")

fct_lump_prop(factor_variable, 
              the minimum proportion of times,
              other_level = "Name you want to use if not 'Other'")


fct_lump_n(factor_variable, 
           the number of high-frequency levels NOT to lump,
           other_level = "Name you want to use if not 'Other'")

Examples

The examples below use a modified version of the msleep dataset called msleep_fctorder. Learn more about this dataset with get_help("msleep").

In this modified dataset, the order column has been coerced into a factor type (instead of character), and all NA values have been removed from that column. (Notice below, the order column is annotated <fct> since it’s a factor).

# Show the modified msleep dataset, msleep_fctorder, with head()
head(msleep_fctorder)
## # A tibble: 6 × 11
##   name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##   <chr> <chr> <chr> <fct> <chr>              <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
## 1 Owl … Aotus omni  Prim… <NA>                17         1.8      NA       7    0.0155    0.48 
## 2 Moun… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6 NA         1.35 
## 3 Grea… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1  0.00029   0.019
## 4 Cow   Bos   herbi Arti… domesticated         4         0.7       0.667  20    0.423   600    
## 5 Thre… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6 NA         3.85 
## 6 Nort… Call… carni Carn… vu                   8.7       1.4       0.383  15.3 NA        20.5

# To help guide you through examples, use the base R function table() count the number of each order 
# This shows us which categories are more/less frequent
table(msleep_fctorder$order)
## 
##    Afrosoricida    Artiodactyla       Carnivora         Cetacea      Chiroptera 
##               1               5               7               1               2 
##       Cingulata Didelphimorphia   Diprotodontia  Erinaceomorpha      Hyracoidea 
##               2               2               2               2               3 
##      Lagomorpha  Perissodactyla          Pilosa        Primates        Rodentia 
##               1               3               1              10              13 
##      Scandentia    Soricomorpha 
##               1               5


# Lump together all levels occurring fewer than 5 times
msleep_fctorder %>%
  mutate(order = fct_lump_min(order, 5)) -> msleep_fctorder_ex1

# Show new levels to confirm they are updated: Notice the new category "Other" contains all the lumped
levels(msleep_fctorder_ex1$order)
## [1] "Artiodactyla" "Carnivora"    "Primates"     "Rodentia"     "Soricomorpha" "Other"


# Lump together all levels occurring fewer than 5 times, name name new category "Fewer than 5x"
msleep_fctorder %>%
  mutate(order = fct_lump_min(order, 5, other_level = "Fewer than 5x")) -> msleep_fctorder_ex2

# Show new levels to confirm they are updated: Notice the new category "Fewer than 5x" contains all the lumped
levels(msleep_fctorder_ex2$order)
## [1] "Artiodactyla"  "Carnivora"     "Primates"      "Rodentia"      "Soricomorpha" 
## [6] "Fewer than 5x"


# Lump together all levels occurring in fewer than 10% of rows
msleep_fctorder %>%
  mutate(order = fct_lump_prop(order, 0.1)) -> msleep_fctorder_ex3

# Show new levels to confirm they are updated
levels(msleep_fctorder_ex3$order)
## [1] "Carnivora" "Primates"  "Rodentia"  "Other"


# Lump together all levels EXCEPT the 3 most common levels
msleep_fctorder %>%
  mutate(order = fct_lump_n(order, 3)) -> msleep_fctorder_ex4

# Show new levels to confirm they are updated
levels(msleep_fctorder_ex4$order)
## [1] "Carnivora" "Primates"  "Rodentia"  "Other"


# Without re-writing the column, change the levels for _plotting purposes only_
# Provide, for example, fct_lump_n(VARIABLE, the n value) to ggplot2::aes() to order in your plot
# This affects the x-axis labeling, so it is best practice to clean up with `labs()`
ggplot(msleep_fctorder) +
      # Lump everything EXCEPT the top 4 levels, and use a custom name for the lumped level
  aes(x = fct_lump_n(order, 4, 
                     other_level = "Additional taxonomic orders"), 
      y = awake) + 
  geom_boxplot() + 
  labs(x = "order")