Working with Expressions: Helper Functions for Advanced Data Manipulation
Source:vignettes/expression-helpers.Rmd
expression-helpers.RmdIntroduction
Three blocks accept expressions: expression filter, mutate, and summarize.
Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:
- Common functions for mutate and summarize
- Helper functions:
across(),where(),if_any(),if_all(),pick() - Column selection helpers
- Practical examples
See also: dplyr column-wise operations and dplyr window functions.
Expression Blocks
Expression filter block: Logical expressions to keep rows
- Expressions return TRUE/FALSE for each row
- Use comparison operators:
>,<,==,!=,>=,<= - Combine with
&(AND) or|(OR) - Helpers:
if_any(),if_all()
Mutate block: Create or modify columns
- Perform calculations on existing columns
- Use arithmetic, logs, lags, cumulative functions
- Helper:
across()to transform multiple columns
Summarize block: Aggregate data
Useful Functions for Mutate
Offsets
-
lag()- previous values -
lead()- next values
Ranking
-
row_number()- sequential ranks (1, 2, 3, 4) -
min_rank()- ranks with ties (1, 2, 2, 4) -
dense_rank()- ranks without gaps (1, 2, 2, 3) -
percent_rank()- percentile (0 to 1) -
ntile(n)- divide into n bins
Examples
# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)
# Lag differences
mpg - lag(mpg)
# Cumulative sums by group (use by parameter)
cumsum(sales)
# Rank values
min_rank(desc(hp))
See dplyr window functions for more.
Useful Functions for Summarize
Position
-
first(),last(),nth(x, 2)- positional values
Examples
# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))
# Count by group (use by parameter)
n()
# Multiple stats
list(
avg = mean(hp),
min = min(hp),
max = max(hp),
count = n()
)
For handling missing values, add na.rm = TRUE:
mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
See dplyr summarise for more.
Column Selection Helpers
Select columns by name pattern or type (used inside
across(), if_any(),
if_all()):
-
everything()- all columns -
starts_with("prefix")- columns starting with prefix -
ends_with("suffix")- columns ending with suffix -
contains("text")- columns containing text -
where(is.numeric)- columns by type (is.character,is.factor) -
c(col1, col2)- specific columns
Combine selections:
c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")
The across() Function
Apply the same operation to multiple columns.
Syntax:
across(.cols, .fns, .names = NULL)
-
.cols- which columns (use selection helpers) -
.fns- function(s) to apply -
.names- control output names (default:{.col}_{.fn})
In Mutate Block
Transform multiple columns:
# Round all numeric columns
across(where(is.numeric), round)
# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))
# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
Use \(x) to create anonymous functions where
x represents the current column.
The pick() Function
Select columns as a data frame for custom functions.
Syntax: pick(.cols)
Use in summarize block with custom functions that need a data frame:
# With custom function
calc_stats(pick(everything()))
# Select specific columns
my_function(pick(c(hp, wt, mpg)))
Relationship to unpack parameter
When your expression returns a data frame, use the unpack option:
- unpack = FALSE (default): Result is a single list-column
- unpack = TRUE: Columns are spread into separate columns
Example:
# Custom function that returns data frame
calc_stats <- function(df) {
data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}
# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns
Filter Helpers: if_any() and if_all()
Check conditions across multiple columns in the expression filter block.
if_any(): TRUE when condition is true for at least one column
# Rows with any NA
if_any(everything(), is.na)
# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)
# Search across text columns
if_any(where(is.character), \(x) x == "setosa")
if_all(): TRUE when condition is true for all columns
# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)
# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)
# No missing values
if_all(everything(), \(x) !is.na(x))
Common Patterns
Mutate Block
# Round numeric columns
across(where(is.numeric), round)
# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))
# Uppercase text
across(where(is.character), toupper)
# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
Tips
Start simple: Test on one column, then use
across()
Check preview: Verify results in the block preview
Unpack option: In summarize, enable unpack when expressions return data frames
Combine helpers: Use & and
| to combine selections
across(where(is.numeric) & starts_with("Sepal"), mean)
Function syntax:
- Direct:
round(no arguments needed) - Anonymous function:
\(x) round(x, 2)(with arguments) - The
xrepresents the current column
Missing values: Add na.rm = TRUE to
aggregation functions
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
Learn More
For comprehensive documentation on column-wise operations, see:
- Column-wise operations vignette - detailed guide to across(), where(), if_any(), and if_all()
- across() reference - complete documentation with all arguments and options
- R for Data Science (2e) - comprehensive tutorial on data transformation
These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.