Skip to contents

Introduction

Three blocks accept expressions: expression filter, mutate, and summarize.

Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:

  • Common functions for mutate and summarize
  • Helper functions: across(), where(), if_any(), if_all(), pick()
  • Column selection helpers
  • Practical examples

See also: dplyr column-wise operations and dplyr window functions.

Expression Blocks

Expression filter block: Logical expressions to keep rows

  • Expressions return TRUE/FALSE for each row
  • Use comparison operators: >, <, ==, !=, >=, <=
  • Combine with & (AND) or | (OR)
  • Helpers: if_any(), if_all()

Mutate block: Create or modify columns

  • Perform calculations on existing columns
  • Use arithmetic, logs, lags, cumulative functions
  • Helper: across() to transform multiple columns

Summarize block: Aggregate data

  • Reduce rows to summary statistics
  • Use aggregation functions: mean(), sum(), n()
  • Helpers: across() for multiple columns, pick() for custom functions

Useful Functions for Mutate

Arithmetic

  • +, -, *, /, ^ - basic operations
  • %/% - integer division
  • %% - remainder (modulo)

Logs and exponentials

Offsets

  • lag() - previous values
  • lead() - next values

Cumulative aggregates

Ranking

  • row_number() - sequential ranks (1, 2, 3, 4)
  • min_rank() - ranks with ties (1, 2, 2, 4)
  • dense_rank() - ranks without gaps (1, 2, 2, 3)
  • percent_rank() - percentile (0 to 1)
  • ntile(n) - divide into n bins

Logical comparisons

  • <, >, <=, >=, ==, !=
  • & (and), | (or), ! (not)
  • %in% - test membership

Examples

# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)

# Lag differences
mpg - lag(mpg)

# Cumulative sums by group (use by parameter)
cumsum(sales)

# Rank values
min_rank(desc(hp))

See dplyr window functions for more.

Useful Functions for Summarize

Center

Spread

  • sd() - standard deviation
  • IQR() - interquartile range
  • mad() - median absolute deviation

Range

  • min(), max() - minimum and maximum
  • quantile(x, 0.25) - percentiles

Position

  • first(), last(), nth(x, 2) - positional values

Count

  • n() - count rows
  • n_distinct() - count unique values
  • sum(!is.na(x)) - count non-missing

Sums and products

Examples

# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))

# Count by group (use by parameter)
n()

# Multiple stats
list(
  avg = mean(hp),
  min = min(hp),
  max = max(hp),
  count = n()
)

For handling missing values, add na.rm = TRUE:

mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

See dplyr summarise for more.

Column Selection Helpers

Select columns by name pattern or type (used inside across(), if_any(), if_all()):

  • everything() - all columns
  • starts_with("prefix") - columns starting with prefix
  • ends_with("suffix") - columns ending with suffix
  • contains("text") - columns containing text
  • where(is.numeric) - columns by type (is.character, is.factor)
  • c(col1, col2) - specific columns

Combine selections:

c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")

The across() Function

Apply the same operation to multiple columns.

Syntax: across(.cols, .fns, .names = NULL)

  • .cols - which columns (use selection helpers)
  • .fns - function(s) to apply
  • .names - control output names (default: {.col}_{.fn})

In Mutate Block

Transform multiple columns:

# Round all numeric columns
across(where(is.numeric), round)

# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))

# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Use \(x) to create anonymous functions where x represents the current column.

In Summarize Block

Calculate statistics for multiple columns:

# Mean of all numeric columns
across(where(is.numeric), mean)

# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))

# With grouping (use by parameter)
across(everything(), n_distinct)

Custom names

# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd

# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp

The pick() Function

Select columns as a data frame for custom functions.

Syntax: pick(.cols)

Use in summarize block with custom functions that need a data frame:

# With custom function
calc_stats(pick(everything()))

# Select specific columns
my_function(pick(c(hp, wt, mpg)))

Relationship to unpack parameter

When your expression returns a data frame, use the unpack option:

  • unpack = FALSE (default): Result is a single list-column
  • unpack = TRUE: Columns are spread into separate columns

Example:

# Custom function that returns data frame
calc_stats <- function(df) {
  data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}

# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns

Filter Helpers: if_any() and if_all()

Check conditions across multiple columns in the expression filter block.

if_any(): TRUE when condition is true for at least one column

# Rows with any NA
if_any(everything(), is.na)

# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)

# Search across text columns
if_any(where(is.character), \(x) x == "setosa")

if_all(): TRUE when condition is true for all columns

# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)

# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)

# No missing values
if_all(everything(), \(x) !is.na(x))

Common Patterns

Mutate Block

# Round numeric columns
across(where(is.numeric), round)

# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))

# Uppercase text
across(where(is.character), toupper)

# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Summarize Block

# Means (add grouping with by parameter)
across(where(is.numeric), mean)

# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))

# Count non-missing
across(everything(), \(x) sum(!is.na(x)))

Expression Filter Block

# Complete rows only
if_all(everything(), \(x) !is.na(x))

# Any negative
if_any(where(is.numeric), \(x) x < 0)

# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)

# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))

Tips

Start simple: Test on one column, then use across()

Check preview: Verify results in the block preview

Unpack option: In summarize, enable unpack when expressions return data frames

Combine helpers: Use & and | to combine selections

across(where(is.numeric) & starts_with("Sepal"), mean)

Function syntax:

  • Direct: round (no arguments needed)
  • Anonymous function: \(x) round(x, 2) (with arguments)
  • The x represents the current column

Missing values: Add na.rm = TRUE to aggregation functions

across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

Learn More

For comprehensive documentation on column-wise operations, see:

These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.