Working with Expressions: Helper Functions for Advanced Data Manipulation

Introduction

Three blocks accept expressions: expression filter, mutate, and summarize.

Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:

Common functions for mutate and summarize
Helper functions: across(), where(), if_any(), if_all(), pick()
Column selection helpers
Practical examples

Expression Blocks

Expression filter block: Logical expressions to keep rows

Expressions return TRUE/FALSE for each row
Use comparison operators: >, <, ==, !=, >=, <=
Combine with & (AND) or | (OR)
Helpers: if_any(), if_all()

Mutate block: Create or modify columns

Perform calculations on existing columns
Use arithmetic, logs, lags, cumulative functions
Helper: across() to transform multiple columns

Summarize block: Aggregate data

Reduce rows to summary statistics
Use aggregation functions: mean(), sum(), n()
Helpers: across() for multiple columns, pick() for custom functions

Useful Functions for Mutate

Arithmetic

+, -, *, /, ^ - basic operations
%/% - integer division
%% - remainder (modulo)

Logs and exponentials

log(), log2(), log10() - logarithms
exp() - exponential
sqrt() - square root

Offsets

lag() - previous values
lead() - next values

Cumulative aggregates

cumsum() - cumulative sum
cummean() - cumulative mean
cummin(), cummax() - cumulative min/max
cumprod() - cumulative product

Ranking

row_number() - sequential ranks (1, 2, 3, 4)
min_rank() - ranks with ties (1, 2, 2, 4)
dense_rank() - ranks without gaps (1, 2, 2, 3)
percent_rank() - percentile (0 to 1)
ntile(n) - divide into n bins

Logical comparisons

<, >, <=, >=, ==, !=
& (and), | (or), ! (not)
%in% - test membership

Examples

# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)

# Lag differences
mpg - lag(mpg)

# Cumulative sums by group (use by parameter)
cumsum(sales)

# Rank values
min_rank(desc(hp))

See dplyr window functions for more.

Useful Functions for Summarize

Center

mean() - average
median() - middle value

Spread

sd() - standard deviation
IQR() - interquartile range
mad() - median absolute deviation

Range

min(), max() - minimum and maximum
quantile(x, 0.25) - percentiles

Position

first(), last(), nth(x, 2) - positional values

Count

n() - count rows
n_distinct() - count unique values
sum(!is.na(x)) - count non-missing

Sums and products

sum() - total
prod() - product

Examples

# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))

# Count by group (use by parameter)
n()

# Multiple stats
list(
  avg = mean(hp),
  min = min(hp),
  max = max(hp),
  count = n()
)

For handling missing values, add na.rm = TRUE:

mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

See dplyr summarise for more.

Column Selection Helpers

Select columns by name pattern or type (used inside across(), if_any(), if_all()):

everything() - all columns
starts_with("prefix") - columns starting with prefix
ends_with("suffix") - columns ending with suffix
contains("text") - columns containing text
where(is.numeric) - columns by type (is.character, is.factor)
c(col1, col2) - specific columns

Combine selections:

c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")

The across() Function

Apply the same operation to multiple columns.

Syntax: across(.cols, .fns, .names = NULL)

.cols - which columns (use selection helpers)
.fns - function(s) to apply
.names - control output names (default: {.col}_{.fn})

In Mutate Block

Transform multiple columns:

# Round all numeric columns
across(where(is.numeric), round)

# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))

# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Use \(x) to create anonymous functions where x represents the current column.

In Summarize Block

Calculate statistics for multiple columns:

# Mean of all numeric columns
across(where(is.numeric), mean)

# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))

# With grouping (use by parameter)
across(everything(), n_distinct)

Custom names

# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd

# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp

The pick() Function

Select columns as a data frame for custom functions.

Syntax: pick(.cols)

Use in summarize block with custom functions that need a data frame:

# With custom function
calc_stats(pick(everything()))

# Select specific columns
my_function(pick(c(hp, wt, mpg)))

Relationship to unpack parameter

When your expression returns a data frame, use the unpack option:

unpack = FALSE (default): Result is a single list-column
unpack = TRUE: Columns are spread into separate columns

Example:

# Custom function that returns data frame
calc_stats <- function(df) {
  data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}

# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns

Filter Helpers: if_any() and if_all()

Check conditions across multiple columns in the expression filter block.

if_any(): TRUE when condition is true for at least one column

# Rows with any NA
if_any(everything(), is.na)

# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)

# Search across text columns
if_any(where(is.character), \(x) x == "setosa")

if_all(): TRUE when condition is true for all columns

# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)

# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)

# No missing values
if_all(everything(), \(x) !is.na(x))

Common Patterns

Mutate Block

# Round numeric columns
across(where(is.numeric), round)

# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))

# Uppercase text
across(where(is.character), toupper)

# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Summarize Block

# Means (add grouping with by parameter)
across(where(is.numeric), mean)

# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))

# Count non-missing
across(everything(), \(x) sum(!is.na(x)))

# Complete rows only
if_all(everything(), \(x) !is.na(x))

# Any negative
if_any(where(is.numeric), \(x) x < 0)

# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)

# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))

Tips

Start simple: Test on one column, then use across()

Check preview: Verify results in the block preview

Unpack option: In summarize, enable unpack when expressions return data frames

Combine helpers: Use & and | to combine selections

across(where(is.numeric) & starts_with("Sepal"), mean)

Function syntax:

Direct: round (no arguments needed)
Anonymous function: \(x) round(x, 2) (with arguments)
The x represents the current column

Missing values: Add na.rm = TRUE to aggregation functions

across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

Learn More

For comprehensive documentation on column-wise operations, see:

Column-wise operations vignette - detailed guide to across(), where(), if_any(), and if_all()
across() reference - complete documentation with all arguments and options
R for Data Science (2e) - comprehensive tutorial on data transformation

These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.

Introduction

Expression Blocks

Useful Functions for Mutate

Arithmetic

Logs and exponentials

Offsets

Cumulative aggregates

Ranking

Logical comparisons

Examples

Useful Functions for Summarize

Center

Spread

Range

Position

Count

Sums and products

Examples

Column Selection Helpers

The across() Function

In Mutate Block

In Summarize Block

Custom names

The pick() Function

Relationship to unpack parameter

Filter Helpers: if_any() and if_all()

Common Patterns

Mutate Block

Summarize Block

Expression Filter Block

Tips

Learn More