34 data.table basics

Figure 34.1: data.table significantly enhances the base R data.frame

34.1 `data.table` extends the functionality of the `data.frame`

Some of the ways in which a data.table differs from a data.frame:

Many operations can be performed within a data.table’s “frame” (dt[i, j, by]): filter cases, select columns & operate on columns, group-by operations
Access column names directly without quoting
Many operations can be performed “in-place” (i.e. with no assignment)
Working on large datasets (e.g. millions of rows) can be orders of magnitude faster with a data.table than a data.frame.

data.table operations remain as close as possible to data.frame operations, trying to extend rather than replace data.frame functionality.

data.table includes thorough and helpful error messages that often point to a solution. This includes common mistakes new users may make when trying commands that would work on a data.frame but not on a data.table.

34.1.1 Load the `data.table` package

library(data.table)


Attaching package: 'data.table'

The following object is masked from 'package:base':

    %notin%

34.2 Create a `data.table`

34.2.1 By assignment: `data.table()`

Let’s create a data.frame and a data.table to explore side by side.

df <- data.frame(A = 1:5,
                 B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                 C = c("a", "b", "b", "a", "a"))
class(df)

[1] "data.frame"

df

  A   B C
1 1 1.2 a
2 2 4.3 b
3 3 9.7 b
4 4 5.6 a
5 5 8.1 a

data.table() syntax is similar to data.frame() (differs in some arguments)

dt <- data.table(A = 1:5,
                 B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                 C = c("a", "b", "b", "a", "a"))
class(dt)

[1] "data.table" "data.frame"

dt

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     2   4.3      b
3:     3   9.7      b
4:     4   5.6      a
5:     5   8.1      a

Notice how a data.table object also inherits from data.frame. This means that if a method does not exist for data.table, the method for data.frame will be used (See classes and generic functions).

As part of improving efficiency, data.tables do away with row names. Instead of using row names, you should use a dedicated column or column with a row identifier/s (e.g. “ID”). this is advisable when working with data.frames as well.

A rather convenient option is to have data.tables print each column’s class below the column name. You can pass the argument class = TRUE to print() or set the global option datatable.print.class using options()

options(datatable.print.class = TRUE)
dt

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     2   4.3      b
3:     3   9.7      b
4:     4   5.6      a
5:     5   8.1      a

Same as with a data.frame, to automatically convert strings to factors, you can use the stringsAsFactors argument:

dt2 <- data.table(A = 1:5,
                  B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                  C = c("a", "b", "b", "a", "a"),
                  stringsAsFactors = TRUE)
dt2

       A     B      C
   <int> <num> <fctr>
1:     1   1.2      a
2:     2   4.3      b
3:     3   9.7      b
4:     4   5.6      a
5:     5   8.1      a

34.2.2 By coercion: `as.data.table()`

dat <- data.frame(A = 1:5,
                  B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                  C = c("a", "b", "b", "a", "a"),
                  stringsAsFactors = TRUE)
dat

  A   B C
1 1 1.2 a
2 2 4.3 b
3 3 9.7 b
4 4 5.6 a
5 5 8.1 a

dat2 <- as.data.table(dat)
dat2

       A     B      C
   <int> <num> <fctr>
1:     1   1.2      a
2:     2   4.3      b
3:     3   9.7      b
4:     4   5.6      a
5:     5   8.1      a

34.2.3 By coercion in-place: `setDT()`

setDT() converts a list or data.frame into a data.table in-place. This means the object passed to setDT() is changed and you do not need to assign the output to a new object.

dat <- data.frame(A = 1:5,
                  B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                  C = c("a", "b", "b", "a", "a"))
class(dat)

[1] "data.frame"

setDT(dat)
class(dat)

[1] "data.table" "data.frame"

You can similarly convert a data.table to a data.frame, in-place:

setDF(dat)
class(dat)

[1] "data.frame"

34.3 Display `data.table` structure with `str()`

str() works the same (and you should keep using it!)

str(df)

'data.frame':   5 obs. of  3 variables:
 $ A: int  1 2 3 4 5
 $ B: num  1.2 4.3 9.7 5.6 8.1
 $ C: chr  "a" "b" "b" "a" ...

str(dt)

Classes 'data.table' and 'data.frame':  5 obs. of  3 variables:
 $ A: int  1 2 3 4 5
 $ B: num  1.2 4.3 9.7 5.6 8.1
 $ C: chr  "a" "b" "b" "a" ...
 - attr(*, ".internal.selfref")=<pointer: 0x10421bd00>

34.4 Combine `data.tables`

cbind() and rbind() work on data.tables the same as on data.frames:

dt1 <- data.table(a = 1:5)
dt2 <- data.table(b = 11:15)
cbind(dt1, dt2)

       a     b
   <int> <int>
1:     1    11
2:     2    12
3:     3    13
4:     4    14
5:     5    15

rbind(dt1, dt1)

34.5 Set column names in-place

dta <- data.table(
  ID = sample(8000:9000, size = 10),
  A = rnorm(10, mean = 47, sd = 8),
  W = rnorm(10, mean = 87, sd = 7)
)
dta

       ID        A        W
    <int>    <num>    <num>
 1:  8165 49.88224 97.25916
 2:  8658 54.50266 87.81571
 3:  8126 48.19081 92.55490
 4:  8108 39.21147 82.72322
 5:  8325 55.42676 90.96360
 6:  8109 48.04564 87.54903
 7:  8410 45.22314 97.25193
 8:  8834 55.43712 74.74517
 9:  8713 41.88000 92.55476
10:  8207 50.48279 87.33159

Use the syntax:

setnames(dt, old, new)

to change the column names of a data.table in-place.

Changes all column names:

setnames(dta, names(dta), c("Patient_ID", "Age", "Weight"))
dta

    Patient_ID      Age   Weight
         <int>    <num>    <num>
 1:       8165 49.88224 97.25916
 2:       8658 54.50266 87.81571
 3:       8126 48.19081 92.55490
 4:       8108 39.21147 82.72322
 5:       8325 55.42676 90.96360
 6:       8109 48.04564 87.54903
 7:       8410 45.22314 97.25193
 8:       8834 55.43712 74.74517
 9:       8713 41.88000 92.55476
10:       8207 50.48279 87.33159

Change subset of names:

old_names <- c("Age", "Weight")
setnames(dta, old_names, paste0(old_names, "_at_Admission"))
dta

    Patient_ID Age_at_Admission Weight_at_Admission
         <int>            <num>               <num>
 1:       8165         49.88224            97.25916
 2:       8658         54.50266            87.81571
 3:       8126         48.19081            92.55490
 4:       8108         39.21147            82.72322
 5:       8325         55.42676            90.96360
 6:       8109         48.04564            87.54903
 7:       8410         45.22314            97.25193
 8:       8834         55.43712            74.74517
 9:       8713         41.88000            92.55476
10:       8207         50.48279            87.33159

old argument can also be integer index of column(s).

For example, change the name of the first column:

setnames(dta, 1, "Hospital_ID")
dta

    Hospital_ID Age_at_Admission Weight_at_Admission
          <int>            <num>               <num>
 1:        8165         49.88224            97.25916
 2:        8658         54.50266            87.81571
 3:        8126         48.19081            92.55490
 4:        8108         39.21147            82.72322
 5:        8325         55.42676            90.96360
 6:        8109         48.04564            87.54903
 7:        8410         45.22314            97.25193
 8:        8834         55.43712            74.74517
 9:        8713         41.88000            92.55476
10:        8207         50.48279            87.33159

34.6 Filter rows

There are many similarities and some notable differences in how indexing works in a data.table vs. a data.frame.

Filtering rows with an integer or logical index is largely the same in a data.frame and a data.table, but in a data.table you can omit the comma to select all columns:

df[c(1, 3, 5), ]

  A   B C
1 1 1.2 a
3 3 9.7 b
5 5 8.1 a

dt[c(1, 3, 5), ]

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     3   9.7      b
3:     5   8.1      a

dt[c(1, 3, 5)]

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     3   9.7      b
3:     5   8.1      a

Using a variable that holds a row index, whether integer or logical:

rowid <- c(1, 3, 5)
df[rowid, ]

  A   B C
1 1 1.2 a
3 3 9.7 b
5 5 8.1 a

dt[rowid, ]

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     3   9.7      b
3:     5   8.1      a

dt[rowid]

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     3   9.7      b
3:     5   8.1      a

rowbn <- c(T, F, T, F, T)
df[rowbn, ]

  A   B C
1 1 1.2 a
3 3 9.7 b
5 5 8.1 a

dt[rowbn, ]

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     3   9.7      b
3:     5   8.1      a

dt[rowbn]

       A     B      C
   <int> <num> <char>
1:     1   1.2      a
2:     3   9.7      b
3:     5   8.1      a

34.6.1 Conditional filtering

As a reminder, there are a few ways to conditionally filter cases in a data.frame:

df[df$A > mean(df$A) & df$B > mean(df$B), ]

  A   B C
5 5 8.1 a

subset(df, A > mean(A) & B > mean(B))

  A   B C
5 5 8.1 a

with(df, df[A > mean(A) & B > mean(B), ])

  A   B C
5 5 8.1 a

data.table allows you to refer to column names directly and unquoted, which makes writing filter conditions easier/more compact:

dt[A > mean(A) & B > mean(B)]

       A     B      C
   <int> <num> <char>
1:     5   8.1      a

The data.table package also includes an S3 method for subset() that works the same way as with a data.frame:

subset(dt, A > mean(A) & B > mean(B))

       A     B      C
   <int> <num> <char>
1:     5   8.1      a

As another example, exclude cases based on missingness in a specific column:

adf <- as.data.frame(sapply(1:5, function(i) rnorm(10)))
adf |> head()

          V1          V2          V3          V4         V5
1  1.0611426 -0.07647069 -1.63253323  0.07868359  0.1939206
2  0.9268508 -1.40891896  0.54455183 -0.03876186 -0.3395723
3 -0.4462552 -1.28665638  0.20117403 -1.22941802  1.4916288
4 -0.9659414 -1.46358592 -2.53397075  0.37150097 -0.5726645
5 -0.9260891 -0.60676517 -0.93670596  0.13302694  1.4052265
6 -0.6103749 -0.92725026  0.05347929 -0.11491545  2.0434445

adf[1, 3] <- adf[3, 4] <- adf[5, 3] <- adf[7, 3] <- NA
adt <- as.data.table(adf)

adf[!is.na(adf$V3), ]

           V1         V2          V3          V4         V5
2   0.9268508 -1.4089190  0.54455183 -0.03876186 -0.3395723
3  -0.4462552 -1.2866564  0.20117403          NA  1.4916288
4  -0.9659414 -1.4635859 -2.53397075  0.37150097 -0.5726645
6  -0.6103749 -0.9272503  0.05347929 -0.11491545  2.0434445
8   1.1112820 -0.4844644 -0.32540925 -0.80632367 -1.4514447
9   0.1170070 -0.2714165 -0.36403259 -0.68801448 -1.2255089
10  0.6518039  1.3967466 -0.89891880  1.52829048  0.8003759

adt[!is.na(V3)]

           V1         V2          V3          V4         V5
        <num>      <num>       <num>       <num>      <num>
1:  0.9268508 -1.4089190  0.54455183 -0.03876186 -0.3395723
2: -0.4462552 -1.2866564  0.20117403          NA  1.4916288
3: -0.9659414 -1.4635859 -2.53397075  0.37150097 -0.5726645
4: -0.6103749 -0.9272503  0.05347929 -0.11491545  2.0434445
5:  1.1112820 -0.4844644 -0.32540925 -0.80632367 -1.4514447
6:  0.1170070 -0.2714165 -0.36403259 -0.68801448 -1.2255089
7:  0.6518039  1.3967466 -0.89891880  1.52829048  0.8003759

34.7 Select columns

34.7.1 By position(s)

Selecting a single column in data.table does not drop to a vector, similar to using drop = FALSE in a data.frame:

df[, 1]

[1] 1 2 3 4 5

df[, 1, drop = FALSE]

dt[, 1]

Double bracket indexing of a single column works the same on a data.frame and a data.table, returning a vector:

df[[2]]

[1] 1.2 4.3 9.7 5.6 8.1

dt[[2]]

[1] 1.2 4.3 9.7 5.6 8.1

A vector of column positions returns a smaller data.table, similar to how it returns a smaller data.frame :

df[, c(1, 2)]

dt[, c(1, 2)]

       A     B
   <int> <num>
1:     1   1.2
2:     2   4.3
3:     3   9.7
4:     4   5.6
5:     5   8.1

34.7.2 By name(s)

In data.table, you access column names directly without quoting or using the $ notation:

df[, "B"]

[1] 1.2 4.3 9.7 5.6 8.1

df$B

[1] 1.2 4.3 9.7 5.6 8.1

dt[, B]

[1] 1.2 4.3 9.7 5.6 8.1

Because of this, data.table requires a slightly different syntax to use a variable as a column index which can contain integer positions, logical index, or column names. While on a data.frame you can do pass an index vector directly:

colid <- c(1, 2)
colbn <- c(FALSE, TRUE, TRUE)
colnm <- c("A", "C")
df[, colid]

df[, colbn]

    B C
1 1.2 a
2 4.3 b
3 9.7 b
4 5.6 a
5 8.1 a

df[, colnm]

To do the same in a data.table, you must prefix the index vector with two dots:

dt[, ..colid]

       A     B
   <int> <num>
1:     1   1.2
2:     2   4.3
3:     3   9.7
4:     4   5.6
5:     5   8.1

dt[, ..colbn]

       B      C
   <num> <char>
1:   1.2      a
2:   4.3      b
3:   9.7      b
4:   5.6      a
5:   8.1      a

dt[, ..colnm]

       A      C
   <int> <char>
1:     1      a
2:     2      b
3:     3      b
4:     4      a
5:     5      a

Think of working inside the data.table frame (i.e. within the “[…]”) like an environment: you have direct access to the variables, i.e. columns within it. If you want to refer to variables outside the data.table, you must prefix their names with .. (similar to how you access the directory above your current working directory in the system shell).

Always read error messages carefully, no matter what function or package you are using. In the case of data.table, the error messages are very informative and often point to the solution.

See what happens if you try to use the data.frame syntax by accident:

dt[, colid]

Error in `[.data.table`:
! j (the 2nd argument inside [...]) is a single symbol but column name 'colid' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..colid]. The .. prefix conveys one-level-up similar to a file system path.

Selecting a single column by name returns a vector:

dt[, A]

[1] 1 2 3 4 5

Selecting one or more columns by name enclosed in list() or .() (which, in this case, is short for list()), always returns a data.table:

dt[, .(A)]

dt[, .(A, B)]

       A     B
   <int> <num>
1:     1   1.2
2:     2   4.3
3:     3   9.7
4:     4   5.6
5:     5   8.1

34.7.3 `.SD` & `.SDcols`

.SDcols is a special symbol that can be used to select columns of a data.table as an alternative to j. It can accept a vector of integer positions or column names. .SD is another special symbol that can be used in j and refers to either the entire data.table, or the subset defined by .SDcols, if present. The following can be used to select columns:

dt[, .SD, .SDcols = colid]

       A     B
   <int> <num>
1:     1   1.2
2:     2   4.3
3:     3   9.7
4:     4   5.6
5:     5   8.1

One of the main uses of .SD is shown below in combination with lapply().

34.8 Add new column in-place

Use := assignment to add a new column in the existing data.table.

dt[, AplusB := A + B]
dt

       A     B      C AplusB
   <int> <num> <char>  <num>
1:     1   1.2      a    2.2
2:     2   4.3      b    6.3
3:     3   9.7      b   12.7
4:     4   5.6      a    9.6
5:     5   8.1      a   13.1

Note how dt was modified even though we did not run dt <- dt[, AplusB := A + B]

let() can also be use for in-place assignment (it uses := internally)

dt[, let(AplusB = A + B)]

34.9 Add multiple columns in-place

You can add multiple new columns in-place and at the same time using the := operator with:

a character vector of new column names on the left hand side
a list of values on the right hand side

Here, we add two new columns, AtimesB and AoverB:

dt[, c("AtimesB", "AoverB") := list(A*B, A/B)]

It is often convenient to use lapply() on the right hand side, since it always returns a list:

vnames <- c("A", "B")
dt[, paste0("log", vnames) := lapply(.SD, log), .SDcols = vnames]
dt

       A     B      C AplusB AtimesB    AoverB      logA      logB
   <int> <num> <char>  <num>   <num>     <num>     <num>     <num>
1:     1   1.2      a    2.2     1.2 0.8333333 0.0000000 0.1823216
2:     2   4.3      b    6.3     8.6 0.4651163 0.6931472 1.4586150
3:     3   9.7      b   12.7    29.1 0.3092784 1.0986123 2.2721259
4:     4   5.6      a    9.6    22.4 0.7142857 1.3862944 1.7227666
5:     5   8.1      a   13.1    40.5 0.6172840 1.6094379 2.0918641

You can also use the “functional form” of :=, resulting in a little more awkward syntax:

dt[, `:=`(AminusB = A - B, AoverC = A / B)]
dt

       A     B      C AplusB AtimesB    AoverB      logA      logB AminusB
   <int> <num> <char>  <num>   <num>     <num>     <num>     <num>   <num>
1:     1   1.2      a    2.2     1.2 0.8333333 0.0000000 0.1823216    -0.2
2:     2   4.3      b    6.3     8.6 0.4651163 0.6931472 1.4586150    -2.3
3:     3   9.7      b   12.7    29.1 0.3092784 1.0986123 2.2721259    -6.7
4:     4   5.6      a    9.6    22.4 0.7142857 1.3862944 1.7227666    -1.6
5:     5   8.1      a   13.1    40.5 0.6172840 1.6094379 2.0918641    -3.1
      AoverC
       <num>
1: 0.8333333
2: 0.4651163
3: 0.3092784
4: 0.7142857
5: 0.6172840

let() is equivalent and offers a cleaner syntax:

dt[, let(
  AminusB = A - B,
  AoverC = A / B)]
dt

       A     B      C AplusB AtimesB    AoverB      logA      logB AminusB
   <int> <num> <char>  <num>   <num>     <num>     <num>     <num>   <num>
1:     1   1.2      a    2.2     1.2 0.8333333 0.0000000 0.1823216    -0.2
2:     2   4.3      b    6.3     8.6 0.4651163 0.6931472 1.4586150    -2.3
3:     3   9.7      b   12.7    29.1 0.3092784 1.0986123 2.2721259    -6.7
4:     4   5.6      a    9.6    22.4 0.7142857 1.3862944 1.7227666    -1.6
5:     5   8.1      a   13.1    40.5 0.6172840 1.6094379 2.0918641    -3.1
      AoverC
       <num>
1: 0.8333333
2: 0.4651163
3: 0.3092784
4: 0.7142857
5: 0.6172840

34.10 Convert column type

34.10.1 Assignment by reference with `:=`

Use any base R coercion function (as.*) to convert a column in-place using the := notation

dt[, A := as.numeric(A)]
dt

       A     B      C AplusB AtimesB    AoverB      logA      logB AminusB
   <num> <num> <char>  <num>   <num>     <num>     <num>     <num>   <num>
1:     1   1.2      a    2.2     1.2 0.8333333 0.0000000 0.1823216    -0.2
2:     2   4.3      b    6.3     8.6 0.4651163 0.6931472 1.4586150    -2.3
3:     3   9.7      b   12.7    29.1 0.3092784 1.0986123 2.2721259    -6.7
4:     4   5.6      a    9.6    22.4 0.7142857 1.3862944 1.7227666    -1.6
5:     5   8.1      a   13.1    40.5 0.6172840 1.6094379 2.0918641    -3.1
      AoverC
       <num>
1: 0.8333333
2: 0.4651163
3: 0.3092784
4: 0.7142857
5: 0.6172840

34.10.2 Delete columns in-place with `:=`

To delete a column, use := to set it to NULL:

dt[, AoverB := NULL]
dt

       A     B      C AplusB AtimesB      logA      logB AminusB    AoverC
   <num> <num> <char>  <num>   <num>     <num>     <num>   <num>     <num>
1:     1   1.2      a    2.2     1.2 0.0000000 0.1823216    -0.2 0.8333333
2:     2   4.3      b    6.3     8.6 0.6931472 1.4586150    -2.3 0.4651163
3:     3   9.7      b   12.7    29.1 1.0986123 2.2721259    -6.7 0.3092784
4:     4   5.6      a    9.6    22.4 1.3862944 1.7227666    -1.6 0.7142857
5:     5   8.1      a   13.1    40.5 1.6094379 2.0918641    -3.1 0.6172840

Delete multiple columns

dt[, c("logA", "logB") := NULL]

Or:

dt[, `:=`(AplusB = NULL, AminusB = NULL)]
dt

       A     B      C AtimesB    AoverC
   <num> <num> <char>   <num>     <num>
1:     1   1.2      a     1.2 0.8333333
2:     2   4.3      b     8.6 0.4651163
3:     3   9.7      b    29.1 0.3092784
4:     4   5.6      a    22.4 0.7142857
5:     5   8.1      a    40.5 0.6172840

34.10.3 Fast loop-able assignment with `set()`

data.table’s set() is a loop-able version of the := operator. Use it in a for loop to operate on multiple columns.

Syntax: set(dt, i, j, value)

dt the data.table to operate on
i optionally define which rows to operate on. i = NULL to operate on all rows
j column names or index to be assigned value
value values to be assigned to j by reference

As a simple example, transform the first two columns in-place by squaring:

for (i in 1:2) {
  set(dt, i = NULL, j = i, value = dt[[i]]^2)
}

34.11 Summarize

You can apply one or multiple summary functions on one or multiple columns. Surround the operations in list() or .() to output a new data.table holding the outputs of the operations, i.e. the input data.table remains unchanged.

A_summary <- dt[, .(A_max = max(A), A_min = min(A), A_sd = sd(A))]
A_summary

   A_max A_min    A_sd
   <num> <num>   <num>
1:    25     1 9.66954

Example: Get the standard deviation of all numeric columns:

numid <- sapply(dt, is.numeric)
dt_mean <- dt[, lapply(.SD, sd), .SDcols = numid]
dt_mean

         A        B  AtimesB    AoverC
     <num>    <num>    <num>     <num>
1: 9.66954 37.35521 15.74462 0.2060219

If your function returns more than one value, the output will have multiple rows:

dt_range <- dt[, lapply(.SD, range), .SDcols = numid]
dt_range

       A     B AtimesB    AoverC
   <num> <num>   <num>     <num>
1:     1  1.44     1.2 0.3092784
2:    25 94.09    40.5 0.8333333

34.12 Set order

You can set the row order of a data.table in-place based on one or multiple columns’ values using setorder()

dt <- data.table(PatientID = sample(1001:9999, size = 10),
                 Height = rnorm(10, mean = 175, sd = 14),
                 Weight = rnorm(10, mean = 78, sd = 10),
                 Group = factor(sample(c("A", "B"), size = 10, replace = TRUE)))
dt

    PatientID   Height   Weight  Group
        <int>    <num>    <num> <fctr>
 1:      1259 190.6105 75.69310      A
 2:      9508 207.1002 58.57327      A
 3:      5474 189.1668 82.46007      B
 4:      2635 177.7637 83.65862      A
 5:      6081 181.1175 64.27897      A
 6:      6082 156.2349 80.47722      A
 7:      2033 197.0302 72.64844      B
 8:      4264 168.1940 82.83728      B
 9:      1188 174.7667 89.06856      A
10:      6132 184.4643 79.27472      A

Let’s set the order by PatientID:

setorder(dt, PatientID)
dt

    PatientID   Height   Weight  Group
        <int>    <num>    <num> <fctr>
 1:      1188 174.7667 89.06856      A
 2:      1259 190.6105 75.69310      A
 3:      2033 197.0302 72.64844      B
 4:      2635 177.7637 83.65862      A
 5:      4264 168.1940 82.83728      B
 6:      5474 189.1668 82.46007      B
 7:      6081 181.1175 64.27897      A
 8:      6082 156.2349 80.47722      A
 9:      6132 184.4643 79.27472      A
10:      9508 207.1002 58.57327      A

Let’s re-order, always in-place, by group and then by height:

setorder(dt, Group, Height)
dt

    PatientID   Height   Weight  Group
        <int>    <num>    <num> <fctr>
 1:      6082 156.2349 80.47722      A
 2:      1188 174.7667 89.06856      A
 3:      2635 177.7637 83.65862      A
 4:      6081 181.1175 64.27897      A
 5:      6132 184.4643 79.27472      A
 6:      1259 190.6105 75.69310      A
 7:      9508 207.1002 58.57327      A
 8:      4264 168.1940 82.83728      B
 9:      5474 189.1668 82.46007      B
10:      2033 197.0302 72.64844      B

34.13 Group-by operations

Up to now, we have learned how to use the data.table frame dat[i, j] to filter cases in i or add/remove/transform columns in-place in j. dat[i, j, by] allows to perform operations separately on groups of cases.

dt <- data.table(A = 1:5,
                 B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                 C = rnorm(5),
                 Group = c("a", "b", "b", "a", "a"))
dt

       A     B          C  Group
   <int> <num>      <num> <char>
1:     1   1.2 -0.3482912      a
2:     2   4.3  0.2924772      b
3:     3   9.7 -0.2112419      b
4:     4   5.6 -0.3364494      a
5:     5   8.1  0.3825026      a

34.13.1 Group-by summary

As we’ve seen, using .() or list() in j, returns a new data.table:

dt[, .(mean_A_by_Group = mean(A)), by = Group]

    Group mean_A_by_Group
   <char>           <num>
1:      a        3.333333
2:      b        2.500000

dt[, list(median_B_by_Group = median(B)), by = Group]

    Group median_B_by_Group
   <char>             <num>
1:      a               5.6
2:      b               7.0

34.13.2 Group-by operation and assignment

Making an assignment with := in j, adds a column in-place. If you combine such an assignment with a group-by operation, the same value will be assigned to all cases of the group:

dt[, mean_A_by_Group := mean(A), by = Group]
dt

       A     B          C  Group mean_A_by_Group
   <int> <num>      <num> <char>           <num>
1:     1   1.2 -0.3482912      a        3.333333
2:     2   4.3  0.2924772      b        2.500000
3:     3   9.7 -0.2112419      b        2.500000
4:     4   5.6 -0.3364494      a        3.333333
5:     5   8.1  0.3825026      a        3.333333

34.14 Apply functions to columns

Any function that returns a list can be used in j to return a new data.table - therefore lapply is perfect for getting summary on multiple columns. This is another example where you have to use the .SD notation:

dt1 <- as.data.table(sapply(1:3, \(i) rnorm(10)))
dt1

             V1           V2         V3
          <num>        <num>      <num>
 1: -0.86353557 -0.227302365  0.9566594
 2:  0.53419963 -0.944049435  0.2303239
 3: -0.94818034  0.045335003  0.8980656
 4: -0.11008160 -0.001435367  0.4107448
 5:  0.24012701  1.626194039 -0.4623279
 6: -1.53714593 -1.083761179  0.6637888
 7:  1.35400215 -1.242644045 -1.4980473
 8: -0.90270771 -0.129723768  0.3419001
 9: -0.08707918 -0.435085844 -0.6597992
10:  0.26244452  0.765344388  1.5186647

setnames(dt1, names(dt1), c("Alpha", "Beta", "Gamma"))
dt1[, lapply(.SD, mean)]

        Alpha       Beta     Gamma
        <num>      <num>     <num>
1: -0.2057957 -0.1627129 0.2399973

You can specify which columns to operate on using the .SDcols argument:

dt2 <- data.table(A = 1:5,
                  B = c(1.2, 4.3, 9.7, 5.6, 8.1),
                  C = rnorm(5),
                  Group = c("a", "b", "b", "a", "a"))
dt2

       A     B          C  Group
   <int> <num>      <num> <char>
1:     1   1.2 -0.8056993      a
2:     2   4.3 -1.0611151      b
3:     3   9.7  0.6312908      b
4:     4   5.6 -1.5834605      a
5:     5   8.1  0.8796774      a

dt2[, lapply(.SD, mean), .SDcols = 1:2]

       A     B
   <num> <num>
1:     3  5.78

# same as
dt2[, lapply(.SD, mean), .SDcols = c("A", "B")]

       A     B
   <num> <num>
1:     3  5.78

cols <- c("A", "B")
dt2[, lapply(.SD, mean), .SDcols = cols]

       A     B
   <num> <num>
1:     3  5.78

You can combine .SDcols and by:

dt2[, lapply(.SD, median), .SDcols = c("B", "C"), by = Group]

    Group     B          C
   <char> <num>      <num>
1:      a   5.6 -0.8056993
2:      b   7.0 -0.2149122

Create multiple new columns from transformation of existing and store with custom prefix:

dt1

          Alpha         Beta      Gamma
          <num>        <num>      <num>
 1: -0.86353557 -0.227302365  0.9566594
 2:  0.53419963 -0.944049435  0.2303239
 3: -0.94818034  0.045335003  0.8980656
 4: -0.11008160 -0.001435367  0.4107448
 5:  0.24012701  1.626194039 -0.4623279
 6: -1.53714593 -1.083761179  0.6637888
 7:  1.35400215 -1.242644045 -1.4980473
 8: -0.90270771 -0.129723768  0.3419001
 9: -0.08707918 -0.435085844 -0.6597992
10:  0.26244452  0.765344388  1.5186647

dt1[, paste0(names(dt1), "_abs") := lapply(.SD, abs)]
dt1

          Alpha         Beta      Gamma  Alpha_abs    Beta_abs Gamma_abs
          <num>        <num>      <num>      <num>       <num>     <num>
 1: -0.86353557 -0.227302365  0.9566594 0.86353557 0.227302365 0.9566594
 2:  0.53419963 -0.944049435  0.2303239 0.53419963 0.944049435 0.2303239
 3: -0.94818034  0.045335003  0.8980656 0.94818034 0.045335003 0.8980656
 4: -0.11008160 -0.001435367  0.4107448 0.11008160 0.001435367 0.4107448
 5:  0.24012701  1.626194039 -0.4623279 0.24012701 1.626194039 0.4623279
 6: -1.53714593 -1.083761179  0.6637888 1.53714593 1.083761179 0.6637888
 7:  1.35400215 -1.242644045 -1.4980473 1.35400215 1.242644045 1.4980473
 8: -0.90270771 -0.129723768  0.3419001 0.90270771 0.129723768 0.3419001
 9: -0.08707918 -0.435085844 -0.6597992 0.08707918 0.435085844 0.6597992
10:  0.26244452  0.765344388  1.5186647 0.26244452 0.765344388 1.5186647

dt2

       A     B          C  Group
   <int> <num>      <num> <char>
1:     1   1.2 -0.8056993      a
2:     2   4.3 -1.0611151      b
3:     3   9.7  0.6312908      b
4:     4   5.6 -1.5834605      a
5:     5   8.1  0.8796774      a

cols <- c("A", "C")
dt2[, paste0(cols, "_groupMean") := lapply(.SD, mean), .SDcols = cols, by = Group]
dt2

       A     B          C  Group A_groupMean C_groupMean
   <int> <num>      <num> <char>       <num>       <num>
1:     1   1.2 -0.8056993      a    3.333333  -0.5031608
2:     2   4.3 -1.0611151      b    2.500000  -0.2149122
3:     3   9.7  0.6312908      b    2.500000  -0.2149122
4:     4   5.6 -1.5834605      a    3.333333  -0.5031608
5:     5   8.1  0.8796774      a    3.333333  -0.5031608

34.15 Row-wise operations

dt <- data.table(a = 1:5, b = 11:15, c = 21:25, 
                 d = 31:35, e = 41:45)
dt

       a     b     c     d     e
   <int> <int> <int> <int> <int>
1:     1    11    21    31    41
2:     2    12    22    32    42
3:     3    13    23    33    43
4:     4    14    24    34    44
5:     5    15    25    35    45

To operate row-wise, we can use by = 1:nrow(dt). For example, to add a column, in-place, with row-wise sums of variables b through d:

dt[, bcd.sum := sum(.SD[, b:d]), by = 1:nrow(dt)]
dt

       a     b     c     d     e bcd.sum
   <int> <int> <int> <int> <int>   <int>
1:     1    11    21    31    41      63
2:     2    12    22    32    42      66
3:     3    13    23    33    43      69
4:     4    14    24    34    44      72
5:     5    15    25    35    45      75

34.16 Watch out for `data.table` error messages

For example

34.1 data.table extends the functionality of the data.frame

34.1.1 Load the data.table package

34.2 Create a data.table

34.2.1 By assignment: data.table()

34.2.2 By coercion: as.data.table()

34.2.3 By coercion in-place: setDT()

34.3 Display data.table structure with str()

34.4 Combine data.tables

34.5 Set column names in-place

34.6 Filter rows

34.6.1 Conditional filtering

34.7 Select columns

34.7.1 By position(s)

34.7.2 By name(s)

34.7.3 .SD & .SDcols

34.8 Add new column in-place

34.9 Add multiple columns in-place

34.10 Convert column type

34.10.1 Assignment by reference with :=

34.10.2 Delete columns in-place with :=

34.10.3 Fast loop-able assignment with set()

34.11 Summarize

34.12 Set order

34.13 Group-by operations

34.13.1 Group-by summary

34.13.2 Group-by operation and assignment

34.14 Apply functions to columns

34.15 Row-wise operations

34.16 Watch out for data.table error messages

34.17 Resources

34.1 `data.table` extends the functionality of the `data.frame`

34.1.1 Load the `data.table` package

34.2 Create a `data.table`

34.2.1 By assignment: `data.table()`

34.2.2 By coercion: `as.data.table()`

34.2.3 By coercion in-place: `setDT()`

34.3 Display `data.table` structure with `str()`

34.4 Combine `data.tables`

34.7.3 `.SD` & `.SDcols`

34.10.1 Assignment by reference with `:=`

34.10.2 Delete columns in-place with `:=`

34.10.3 Fast loop-able assignment with `set()`

34.16 Watch out for `data.table` error messages