library(data.table)
options(datatable.print.class = TRUE)36 Reshaping
36.1 Long to wide using dcast()
36.1.1 Example 1: key-value pairs
Using the same example seen in Chapter 26:
dat_long <- data.frame(
Account_ID = c(8001, 8002, 8003, 8004, 8001, 8002, 8003, 8004,
8001, 8002, 8003, 8004, 8001, 8002, 8003, 8004),
Age = c(67.8017038366664, 42.9198507293701, 46.2301756642422,
39.665983196671, 67.8017038366664, 42.9198507293701,
46.2301756642422, 39.665983196671, 67.8017038366664,
42.9198507293701, 46.2301756642422, 39.665983196671,
67.8017038366664, 42.9198507293701, 46.2301756642422,
39.665983196671),
Admission = c("ED", "Planned", "Planned", "ED", "ED", "Planned",
"Planned", "ED", "ED", "Planned", "Planned", "ED", "ED", "Planned",
"Planned", "ED"),
Lab_key = c("RBC", "RBC", "RBC", "RBC", "WBC", "WBC", "WBC", "WBC",
"Hematocrit", "Hematocrit", "Hematocrit", "Hematocrit",
"Hemoglobin", "Hemoglobin", "Hemoglobin", "Hemoglobin"),
Lab_value = c(4.63449321082268, 3.34968550627897, 4.27037213597765,
4.93897736897793, 8374.22887757195, 7612.37380499927,
8759.27855519425, 6972.28096216548, 36.272693147236,
40.5716317809522, 39.9888624177955, 39.8786884058422,
12.6188444991545, 12.1739747363806, 15.1293426442183,
14.8885696185238)
)
dat_long_dt <- as.data.table(dat_long)
dat_long_dt Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <char> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8002 42.91985 Planned RBC 3.349686
3: 8003 46.23018 Planned RBC 4.270372
4: 8004 39.66598 ED RBC 4.938977
5: 8001 67.80170 ED WBC 8374.228878
6: 8002 42.91985 Planned WBC 7612.373805
7: 8003 46.23018 Planned WBC 8759.278555
8: 8004 39.66598 ED WBC 6972.280962
9: 8001 67.80170 ED Hematocrit 36.272693
10: 8002 42.91985 Planned Hematocrit 40.571632
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8004 39.66598 ED Hematocrit 39.878688
13: 8001 67.80170 ED Hemoglobin 12.618844
14: 8002 42.91985 Planned Hemoglobin 12.173975
15: 8003 46.23018 Planned Hemoglobin 15.129343
16: 8004 39.66598 ED Hemoglobin 14.888570
data.table’s long to wide procedure is defined with a convenient formula notation:
dat_long2wide_dt <- dcast(dat_long_dt,
Account_ID + Age + Admission ~ Lab_key,
value.var = "Lab_value")
dat_long2wide_dtKey: <Account_ID, Age, Admission>
Account_ID Age Admission Hematocrit Hemoglobin RBC WBC
<num> <num> <char> <num> <num> <num> <num>
1: 8001 67.80170 ED 36.27269 12.61884 4.634493 8374.229
2: 8002 42.91985 Planned 40.57163 12.17397 3.349686 7612.374
3: 8003 46.23018 Planned 39.98886 15.12934 4.270372 8759.279
4: 8004 39.66598 ED 39.87869 14.88857 4.938977 6972.281
Instead of listing all variables you can use ..., which corresponds to all variables not otherwise mentioned in the formula or in value.var:
dcast(dat_long_dt,
... ~ Lab_key,
value.var = "Lab_value")Key: <Account_ID, Age, Admission>
Account_ID Age Admission Hematocrit Hemoglobin RBC WBC
<num> <num> <char> <num> <num> <num> <num>
1: 8001 67.80170 ED 36.27269 12.61884 4.634493 8374.229
2: 8002 42.91985 Planned 40.57163 12.17397 3.349686 7612.374
3: 8003 46.23018 Planned 39.98886 15.12934 4.270372 8759.279
4: 8004 39.66598 ED 39.87869 14.88857 4.938977 6972.281
36.1.2 Example 2: Timepoints
Another simple synthetic dataset:
ID Timepoint Score
<int> <fctr> <int>
1: 1 Timepoint_A 11
2: 2 Timepoint_A 12
3: 3 Timepoint_A 13
4: 4 Timepoint_A 14
5: 1 Timepoint_B 21
6: 2 Timepoint_B 22
7: 3 Timepoint_B 23
8: 4 Timepoint_B 24
9: 1 Timepoint_C 51
10: 2 Timepoint_C 52
11: 3 Timepoint_C 53
12: 4 Timepoint_C 54
Using dcast() again to convert to wide format:
dt_long ID Timepoint Score
<int> <fctr> <int>
1: 1 Timepoint_A 11
2: 2 Timepoint_A 12
3: 3 Timepoint_A 13
4: 4 Timepoint_A 14
5: 1 Timepoint_B 21
6: 2 Timepoint_B 22
7: 3 Timepoint_B 23
8: 4 Timepoint_B 24
9: 1 Timepoint_C 51
10: 2 Timepoint_C 52
11: 3 Timepoint_C 53
12: 4 Timepoint_C 54
dcast(dt_long, ID ~ Timepoint,
value.var = "Score")Key: <ID>
ID Timepoint_A Timepoint_B Timepoint_C
<int> <int> <int> <int>
1: 1 11 21 51
2: 2 12 22 52
3: 3 13 23 53
4: 4 14 24 54
36.1.3 dcast() + aggregate
If your ID ~ Timepoint combination does not define a unique row in your input dataset, you need to specify an aggregate function.
For example, suppose you have four subjects with IDs “A”, “B”, “C”, “D” who had a couple variables measured 3 times in the AM and 3 times in the PM.
ID Timepoint Var1 Var2
<char> <char> <num> <num>
1: A AM 9.585122 20.08581
2: A AM 10.989040 20.93060
3: A AM 10.172208 20.24579
4: A PM 8.636507 18.71478
5: A PM 10.176692 18.79135
6: A PM 10.224978 20.40239
7: B AM NA 19.70714
8: B AM 10.974991 21.04424
9: B AM 9.721080 NA
10: B PM 9.268843 20.44342
11: B PM 10.866414 19.07485
12: B PM 10.534914 21.80724
13: C AM NA 19.68740
14: C AM 10.474090 21.86420
15: C AM 9.986519 20.67795
16: C PM 9.642834 19.75448
17: C PM 10.602526 19.63772
18: C PM 11.082441 18.20862
19: D AM 11.076375 NA
20: D AM NA NA
21: D AM NA NA
22: D PM 9.813396 18.52429
23: D PM 10.794294 20.36052
24: D PM 10.804556 19.51999
ID Timepoint Var1 Var2
<char> <char> <num> <num>
To convert the above data.table to wide format and get mean AM and PM values using the fun.aggregate argument:
Key: <ID>
ID Var1_AM Var1_PM Var2_AM Var2_PM
<char> <num> <num> <num> <num>
1: A 10.24879 9.679392 20.42074 19.30284
2: B 10.34804 10.223390 20.37569 20.44184
3: C 10.23030 10.442601 20.74318 19.20027
4: D 11.07638 10.470749 NaN 19.46827
You can apply multiple aggregating functions by passing a list to fun.aggregate:
Warning in max(Var2, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Warning in min(Var2, na.rm = TRUE): no non-missing arguments to min; returning
Inf
Key: <ID>
ID Var1_mean_AM Var1_mean_PM Var2_mean_AM Var2_mean_PM Var1_max_AM
<char> <num> <num> <num> <num> <num>
1: A 10.24879 9.679392 20.42074 19.30284 10.98904
2: B 10.34804 10.223390 20.37569 20.44184 10.97499
3: C 10.23030 10.442601 20.74318 19.20027 10.47409
4: D 11.07638 10.470749 NaN 19.46827 11.07638
Var1_max_PM Var2_max_AM Var2_max_PM Var1_min_AM Var1_min_PM Var2_min_AM
<num> <num> <num> <num> <num> <num>
1: 10.22498 20.93060 20.40239 9.585122 8.636507 20.08581
2: 10.86641 21.04424 21.80724 9.721080 9.268843 19.70714
3: 11.08244 21.86420 19.75448 9.986519 9.642834 19.68740
4: 10.80456 -Inf 20.36052 11.076375 9.813396 Inf
Var2_min_PM
<num>
1: 18.71478
2: 19.07485
3: 18.20862
4: 18.52429
Note how na.rm = TRUE was successfully applied to all aggregating functions
36.2 Wide to long: melt()
36.2.1 Example 1: Timepoints
dt_wide <- data.table(
ID = 1:4,
Timepoint_A = 11:14,
Timepoint_B = 21:24,
Timepoint_C = 51:54)
dt_wide ID Timepoint_A Timepoint_B Timepoint_C
<int> <int> <int> <int>
1: 1 11 21 51
2: 2 12 22 52
3: 3 13 23 53
4: 4 14 24 54
Use melt() to convert from wide to long format:
ID Timepoint Score
<int> <fctr> <int>
1: 1 Timepoint_A 11
2: 2 Timepoint_A 12
3: 3 Timepoint_A 13
4: 4 Timepoint_A 14
5: 1 Timepoint_B 21
6: 2 Timepoint_B 22
7: 3 Timepoint_B 23
8: 4 Timepoint_B 24
9: 1 Timepoint_C 51
10: 2 Timepoint_C 52
11: 3 Timepoint_C 53
12: 4 Timepoint_C 54
36.2.2 Example 2: Key-value pairs
Using the same synthetic data as in Chapter 26:
set.seed(2022)
dt_wide <- data.table(
Account_ID = c(8001, 8002, 8003, 8004),
Age = rnorm(4, mean = 57, sd = 12),
RBC = rnorm(4, mean = 4.8, sd = 0.5),
WBC = rnorm(4, mean = 7250, sd = 1500),
Hematocrit = rnorm(4, mean = 40.2, sd = 4),
Hemoglobin = rnorm(4, mean = 13.6, sd = 1.5),
Admission = sample(c("ED", "Planned"), size = 4, replace = TRUE)
)
dt_wide Account_ID Age RBC WBC Hematocrit Hemoglobin Admission
<num> <num> <num> <num> <num> <num> <char>
1: 8001 67.80170 4.634493 8374.229 36.27269 12.61884 ED
2: 8002 42.91985 3.349686 7612.374 40.57163 12.17397 Planned
3: 8003 46.23018 4.270372 8759.279 39.98886 15.12934 Planned
4: 8004 39.66598 4.938977 6972.281 39.87869 14.88857 Planned
Convert from wide to long format using melt() again:
Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <fctr> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8002 42.91985 Planned RBC 3.349686
3: 8003 46.23018 Planned RBC 4.270372
4: 8004 39.66598 Planned RBC 4.938977
5: 8001 67.80170 ED WBC 8374.228878
6: 8002 42.91985 Planned WBC 7612.373805
7: 8003 46.23018 Planned WBC 8759.278555
8: 8004 39.66598 Planned WBC 6972.280962
9: 8001 67.80170 ED Hematocrit 36.272693
10: 8002 42.91985 Planned Hematocrit 40.571632
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8004 39.66598 Planned Hematocrit 39.878688
13: 8001 67.80170 ED Hemoglobin 12.618844
14: 8002 42.91985 Planned Hemoglobin 12.173975
15: 8003 46.23018 Planned Hemoglobin 15.129343
16: 8004 39.66598 Planned Hemoglobin 14.888570
or using column names:
Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <fctr> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8002 42.91985 Planned RBC 3.349686
3: 8003 46.23018 Planned RBC 4.270372
4: 8004 39.66598 Planned RBC 4.938977
5: 8001 67.80170 ED WBC 8374.228878
6: 8002 42.91985 Planned WBC 7612.373805
7: 8003 46.23018 Planned WBC 8759.278555
8: 8004 39.66598 Planned WBC 6972.280962
9: 8001 67.80170 ED Hematocrit 36.272693
10: 8002 42.91985 Planned Hematocrit 40.571632
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8004 39.66598 Planned Hematocrit 39.878688
13: 8001 67.80170 ED Hemoglobin 12.618844
14: 8002 42.91985 Planned Hemoglobin 12.173975
15: 8003 46.23018 Planned Hemoglobin 15.129343
16: 8004 39.66598 Planned Hemoglobin 14.888570
If desired, you can set the ID column as the key, which will sort the data.table by its values:
setorder(dt_long, "Account_ID")
dt_long Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <fctr> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8001 67.80170 ED WBC 8374.228878
3: 8001 67.80170 ED Hematocrit 36.272693
4: 8001 67.80170 ED Hemoglobin 12.618844
5: 8002 42.91985 Planned RBC 3.349686
6: 8002 42.91985 Planned WBC 7612.373805
7: 8002 42.91985 Planned Hematocrit 40.571632
8: 8002 42.91985 Planned Hemoglobin 12.173975
9: 8003 46.23018 Planned RBC 4.270372
10: 8003 46.23018 Planned WBC 8759.278555
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8003 46.23018 Planned Hemoglobin 15.129343
13: 8004 39.66598 Planned RBC 4.938977
14: 8004 39.66598 Planned WBC 6972.280962
15: 8004 39.66598 Planned Hematocrit 39.878688
16: 8004 39.66598 Planned Hemoglobin 14.888570