25 data.frame basics

Core data.frame operations:

Operation	Functions
Filter	`[` `subset(x)`
Find duplicate rows	Get index of duplicated: `duplicated(x)` Keep only unique rows: `unique(x)`
Select	`[`; `[[`; `$`; `subset(df, select=)`
Add column(s)	`$`, `[`, `[[` `cbind(x1, x2)` `transform(x, new=)`
Add row(s)	`rbind(...)`
Reshape	Long-to-Wide: `reshape(direction = "wide")` Wide-to-Long: `reshape(direction = "long")`
Join	Inner: `merge(x, y, all=FALSE)` Outer: `merge(x, y, all=TRUE)` Left: `merge(x, y, all.x=TRUE)` Right: `merge(x, y, all.y=TRUE)`

25.1 Synthetic Data

set.seed(2022)
n <- 10
dat <- data.frame(
  PID = 8001:8010,
  Sex = sample(c("Male", "Female"), size = n, replace = TRUE),
  Age = rnorm(n, mean = 58, sd = 12),
  Height = rnorm(n, mean = 178, sd = 10),
  Weight = rnorm(n, mean = 80, sd = 12),
  Group = sample(c("A", "B"), size = n, replace = TRUE)
)

25.2 Filter

Filter & select operations are performed by indexing rows and cases, respectively.

25.3 Column and row names

Let’s start with a simple example data.frame:

df <- data.frame(PID = c(111:119),
                 Hospital = c("UCSF", "HUP", "Stanford",
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
df

  PID Hospital Age Sex
1 111     UCSF  22   1
2 112      HUP  34   1
3 113 Stanford  41   0
4 114 Stanford  19   1
5 115     UCSF  53   0
6 116      HUP  21   0
7 117      HUP  63   1
8 118 Stanford  22   0
9 119     UCSF  19   0

The optional row.names argument (see data.frame usage in the R documentation) can be used to define row names at the time of the data frame creation. It accepts either:

a single integer or a character specifying a column of the data.frame being created whose values should be used as row names, or
a vector of values (character or integer) of the row names to be used

For example, we can use the “PID” column:

df <- data.frame(PID = c(111:119),
                 Hospital = c("UCSF", "HUP", "Stanford",
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
                row.names = "PID")

It is recommended to not use/depend on row names to identify or index data.frames, and instead include a column of case IDs.

We can get column names and row names with colnames() and rownames(), respectively:

colnames(df)

[1] "Hospital" "Age"      "Sex"

rownames(df)

[1] "111" "112" "113" "114" "115" "116" "117" "118" "119"

To set new column or row names use the form:

colnames(df) <- new.colnames

rownames(df) <- new.rownames

where new.colnames and new.rownames is a character vector.

You can rename all columns/rows or use indexing to replace specific names:

Rename all rows:

rownames(df) <- paste0("Patient_", 1:9)
df

          Hospital Age Sex
Patient_1     UCSF  22   1
Patient_2      HUP  34   1
Patient_3 Stanford  41   0
Patient_4 Stanford  19   1
Patient_5     UCSF  53   0
Patient_6      HUP  21   0
Patient_7      HUP  63   1
Patient_8 Stanford  22   0
Patient_9     UCSF  19   0

Rename first two columns:

colnames(df)[1:2] <- c("Center", "Age_at_Dx")
df

            Center Age_at_Dx Sex
Patient_1     UCSF        22   1
Patient_2      HUP        34   1
Patient_3 Stanford        41   0
Patient_4 Stanford        19   1
Patient_5     UCSF        53   0
Patient_6      HUP        21   0
Patient_7      HUP        63   1
Patient_8 Stanford        22   0
Patient_9     UCSF        19   0

25.4 Delete columns or rows

To delete a data.frame column, set it to NULL:

df$Sex <- NULL
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_3 Stanford        41
Patient_4 Stanford        19
Patient_5     UCSF        53
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

To delete a data.frame row, you can “index it out”.

For example, to remove the third and fifths rows of the above data.frame using an integer index:

df <- df[-c(3, 5), ]
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_4 Stanford        19
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

You can similarly exclude a row using a logical index. Logical indexing occurs usually following some filtering condition.

For example, exclude patients under 20 years old:

df <- df[!df$Age < 20, ]
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22

25.5 `subset()`

subset() allows you to

filter cases that meet certain conditions using the subset argument
select columns using the select argument

(head() returns the first few lines of a data frame. We use it to avoid printing too many lines)

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

iris_sl.gt.med <- subset(iris, Sepal.Length > median(Sepal.Length))

Note: You can use the column name Sepal.Length directly, i.e. unquoted and you don’t need to use iris$Sepal.Length.

This is called Non-Standard Evaluation (NSE).

x <- data.frame(one = 1:10,
                two = rnorm(10),
                group = c(rep("alpha", 4),  rep("beta", 6)))
subset(x, subset = two > 0, select = two)

          two
2  0.80776842
5  0.06035668
7  0.34027593
10 0.36817344

subset(x, two > 0, -one)

          two group
2  0.80776842 alpha
5  0.06035668  beta
7  0.34027593  beta
10 0.36817344  beta

subset(x, two > 0, two:one)

          two one
2  0.80776842   2
5  0.06035668   5
7  0.34027593   7
10 0.36817344  10

subset(x, two > 0, two:group)

          two group
2  0.80776842 alpha
5  0.06035668  beta
7  0.34027593  beta
10 0.36817344  beta

25.6 `split()`

Split a data frame into multiple data frames by groups defined by a factor:

x_by_group <- split(x, x$group)
x_by_group

$alpha
  one        two group
1   1 -0.2690377 alpha
2   2  0.8077684 alpha
3   3 -1.1247172 alpha
4   4 -1.4307880 alpha

$beta
   one         two group
5    5  0.06035668  beta
6    6 -0.79298250  beta
7    7  0.34027593  beta
8    8 -0.25946873  beta
9    9 -1.30484865  beta
10  10  0.36817344  beta

25.7 `with()`

Within a with() expression, you can access list elements or data.frame columns without quoting or using the $ operator:

with(x, one + two)

 [1]  0.7309623  2.8077684  1.8752828  2.5692120  5.0603567  5.2070175
 [7]  7.3402759  7.7405313  7.6951514 10.3681734

with(x, x[group == "alpha", ])

  one        two group
1   1 -0.2690377 alpha
2   2  0.8077684 alpha
3   3 -1.1247172 alpha
4   4 -1.4307880 alpha

with(x, x[two > 0, ])

   one        two group
2    2 0.80776842 alpha
5    5 0.06035668  beta
7    7 0.34027593  beta
10  10 0.36817344  beta

25.8 Feature transformation with `transform()`

Make up some data:

dat <- data.frame(Sex = c(0, 0, 1, 1, 0),
                  Height = c(1.5, 1.6, 1.55, 1.73, 1.8),
                  Weight = c(55, 70, 69, 76, 91))

dat <- transform(dat, BMI = Weight/Height^2)
dat

  Sex Height Weight      BMI
1   0   1.50     55 24.44444
2   0   1.60     70 27.34375
3   1   1.55     69 28.72008
4   1   1.73     76 25.39343
5   0   1.80     91 28.08642

transform() is probably not used too often, because it is trivial to do the same with direct assignment:

dat$BMI <- dat$Weight/dat$Height^2

but can be useful when adding multiple variables and/or used in a pipe:

dat |> 
  subset(Sex == 0) |> 
  transform(DeltaWeightFromMean = Weight - mean(Weight),
            BMI = Weight/Height^2,
            CI = Weight/Height^3)

  Sex Height Weight      BMI DeltaWeightFromMean       CI
1   0    1.5     55 24.44444                 -17 16.29630
2   0    1.6     70 27.34375                  -2 17.08984
5   0    1.8     91 28.08642                  19 15.60357

25.9 Identify and remove duplicated rows with `duplicated()` and `unique()`

When applied on a data.frame, the duplicated() function returns a logical index specifying the location of duplicated rows - specifically, a logical index of rows which are the duplicate of another row further up the data.frame. This means that if rows 20 and 23 are identical, duplicated() will return TRUE for row 23.

On the other hand, unique() will remove duplicate rows from a data.frame.

x <- data.frame(ID = c(203, 808, 909, 707, 808),
                Age = c(23, 44, 33, 42, 44))

duplicated(x)

[1] FALSE FALSE FALSE FALSE  TRUE

unique(x)

25.10 Order data.frame rows

Let’s create another synthetic data.frame:

set.seed(2022)
n <- 10
dat <- data.frame(
  PID = 8001:8010,
  Age = sample(20:50, n, replace = TRUE),
  Group = sample(LETTERS[1:2], n, replace = TRUE)
)
dat

    PID Age Group
1  8001  23     A
2  8002  38     A
3  8003  33     B
4  8004  42     B
5  8005  30     A
6  8006  23     A
7  8007  25     B
8  8008  50     A
9  8009  33     B
10 8010  47     B

To order data.frame rows by the values of one or more columns, you can use the order() function to create an index that can then be used to reorder the data.frame.

To order dat by Age in ascending order:

dat[order(dat$Age), ]

    PID Age Group
1  8001  23     A
6  8006  23     A
7  8007  25     B
5  8005  30     A
3  8003  33     B
9  8009  33     B
2  8002  38     A
4  8004  42     B
10 8010  47     B
8  8008  50     A

To order dat by Age in descending order:

dat[order(dat$Age, decreasing = TRUE), ]

    PID Age Group
8  8008  50     A
10 8010  47     B
4  8004  42     B
2  8002  38     A
3  8003  33     B
9  8009  33     B
5  8005  30     A
7  8007  25     B
1  8001  23     A
6  8006  23     A

Or equivalently, using the negative sign to indicate descending order:

dat[order(-dat$Age), ]

    PID Age Group
8  8008  50     A
10 8010  47     B
4  8004  42     B
2  8002  38     A
3  8003  33     B
9  8009  33     B
5  8005  30     A
7  8007  25     B
1  8001  23     A
6  8006  23     A

You can also order by multiple columns. For example, to order first by Group (A before B), then within each group by Age in ascending order:

dat[order(dat$Group, dat$Age), ]

    PID Age Group
1  8001  23     A
6  8006  23     A
5  8005  30     A
2  8002  38     A
8  8008  50     A
7  8007  25     B
3  8003  33     B
9  8009  33     B
4  8004  42     B
10 8010  47     B

Order by Group (A before B), then within each group by Age in descending order:

dat[order(dat$Group, -dat$Age), ]

    PID Age Group
8  8008  50     A
2  8002  38     A
5  8005  30     A
1  8001  23     A
6  8006  23     A
10 8010  47     B
4  8004  42     B
3  8003  33     B
9  8009  33     B
7  8007  25     B

In this case, we don’t want to specify decreasing = TRUE because that would reverse the order of both columns. The negative sign comes in handy here.

25.11 See also

See the data frame section for an introduction to data frames.

25.1 Synthetic Data

25.2 Filter

25.3 Column and row names

25.4 Delete columns or rows

25.5 subset()

25.6 split()

25.7 with()

25.8 Feature transformation with transform()

25.9 Identify and remove duplicated rows with duplicated() and unique()

25.10 Order data.frame rows

25.11 See also

25.5 `subset()`

25.6 `split()`

25.7 `with()`

25.8 Feature transformation with `transform()`

25.9 Identify and remove duplicated rows with `duplicated()` and `unique()`