20 Aggregate

R includes a number of commands to apply functions on splits of your data. aggregate() is a powerful tools to perform such “group-by” operations.

The function accepts either:

a formula as the first argument and a data.frame passed to the data argument
an R objects (vector, data.frame, list) as the first argument and one or more factors passed to the by argument

We shall see how to perform each operation below with each approach.

The formula interface might be easier to work with interactively on the console. Note that while you can programmatically create a formula, it is easier to use vector inputs when calling aggregate() programmatically.

For this example, we shall use the penguins data, which has been added in base R.

str(penguins)

'data.frame':   344 obs. of  8 variables:
 $ species    : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_len   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_dep   : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_len: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass  : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex        : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year       : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

20.1 Single variable by single grouping

Note that the formula method defaults to na.action = na.omit

Using the formula interface:

aggregate(bill_len ~ species,
          data = penguins,
          mean, na.rm = TRUE)

    species bill_len
1    Adelie 38.79139
2 Chinstrap 48.83382
3    Gentoo 47.50488

Using R objects directly:

aggregate(penguins$bill_len,
          by = list(penguins$species),
          mean, na.rm = TRUE)

    Group.1        x
1    Adelie 38.79139
2 Chinstrap 48.83382
3    Gentoo 47.50488

Note that, unlike the formula notation, if your input is a vector which is unnamed, the output columns are also unnamed.

If instead of passing a vector, you pass a data.frame or list with one or more named elements, the output includes the names:

aggregate(penguins["bill_len"],
          by = penguins["species"],
          mean, na.rm = TRUE)

    species bill_len
1    Adelie 38.79139
2 Chinstrap 48.83382
3    Gentoo 47.50488

By creating a list instead of indexing the given data.frame also allows you to set custom names:

aggregate(list(`Bill length` = penguins$bill_len),
          by = list(Species = penguins$species),
          mean, na.rm = TRUE)

    Species Bill.length
1    Adelie    38.79139
2 Chinstrap    48.83382
3    Gentoo    47.50488

20.2 Multiple variables by single grouping

Formula notation:

aggregate(cbind(bill_len, flipper_len) ~ species,
          data = penguins,
          mean)

    species bill_len flipper_len
1    Adelie 38.79139    189.9536
2 Chinstrap 48.83382    195.8235
3    Gentoo 47.50488    217.1870

Objects:

aggregate(penguins[, c("bill_len", "flipper_len")],
          by = list(Species = penguins$species),
          mean, na.rm = TRUE)

    Species bill_len flipper_len
1    Adelie 38.79139    189.9536
2 Chinstrap 48.83382    195.8235
3    Gentoo 47.50488    217.1870

20.3 Single variable by multiple groups

Formula notation:

aggregate(bill_len ~ species + island, data = penguins, mean)

    species    island bill_len
1    Adelie    Biscoe 38.97500
2    Gentoo    Biscoe 47.50488
3    Adelie     Dream 38.50179
4 Chinstrap     Dream 48.83382
5    Adelie Torgersen 38.95098

Objects:

aggregate(penguins["bill_len"],
          by = list(Species = penguins$species, 
                    Island = penguins$island),
          mean, na.rm = TRUE)

    Species    Island bill_len
1    Adelie    Biscoe 38.97500
2    Gentoo    Biscoe 47.50488
3    Adelie     Dream 38.50179
4 Chinstrap     Dream 48.83382
5    Adelie Torgersen 38.95098

20.4 Multiple variables by multiple groupings

Formula notation:

aggregate(cbind(bill_len, flipper_len) ~ species + island,
          data = penguins, mean)

    species    island bill_len flipper_len
1    Adelie    Biscoe 38.97500    188.7955
2    Gentoo    Biscoe 47.50488    217.1870
3    Adelie     Dream 38.50179    189.7321
4 Chinstrap     Dream 48.83382    195.8235
5    Adelie Torgersen 38.95098    191.1961

Objects:

aggregate(penguins[, c("bill_len", "flipper_len")],
          by = list(Species = penguins$species, 
                    Island = penguins$island),
          mean, na.rm = TRUE)

    Species    Island bill_len flipper_len
1    Adelie    Biscoe 38.97500    188.7955
2    Gentoo    Biscoe 47.50488    217.1870
3    Adelie     Dream 38.50179    189.7321
4 Chinstrap     Dream 48.83382    195.8235
5    Adelie Torgersen 38.95098    191.1961

20.5 Using `with()`

R’s with() allows you to use expression of the form with(data, expression). data can be a data.frame, list, or environment, and within the expression you can refer to any elements of data directly by their name.

For example, with(df, expression) means you can use the data.frame’s column names directly within the expression without the need to use df[["column_name"]] or df$column_name:

with(penguins,
     aggregate(list(`Bill length` = bill_len),
               by = list(Species = species),
               mean, na.rm = TRUE))

    Species Bill.length
1    Adelie    38.79139
2 Chinstrap    48.83382
3    Gentoo    47.50488

20.6 See also

tapply() for an alternative methods of applying function on subsets of a single variable (probably faster).
For large datasets, it is recommended to use data.table for fast group-by data summarization.

20.1 Single variable by single grouping

20.2 Multiple variables by single grouping

20.3 Single variable by multiple groups

20.4 Multiple variables by multiple groupings

20.5 Using with()

20.6 See also

20.5 Using `with()`