Data wrangling blocks

Introduction

blockr.dplyr provides interactive blocks for data wrangling. Each block offers a user interface for a specific data transformation task. Blocks can be connected together to create data transformation pipelines.

This package includes blocks for common dplyr operations (select, filter, arrange, mutate, summarize, join, bind) and tidyr operations (pivot, separate, unite).

Select Block

The select block chooses which columns to keep in your dataset.

Use the column selector to pick the columns you want. You can select multiple columns and reorder them by dragging. The order of selection determines the column order in the output.

The block includes a “distinct” option. When enabled, duplicate rows are removed from the result, keeping only unique combinations of the selected columns.

Filter Expression Block

The filter expression block keeps only rows that meet specific conditions using R expressions. Use this block for more elaborate filtering that cannot be achieved with simple value selection, such as numeric comparisons, calculations, or complex logical conditions.

Supported operators include >, <, ==, !=, >=, <= for comparisons, and %in% for checking membership in a set of values. Combine multiple conditions using & (AND) to require all conditions to be true, or | (OR) to require at least one condition to be true. The expression editor provides syntax highlighting and validates your expressions. Examples: mpg > 20, cyl == 4 | cyl == 6, hp > 100 & wt < 3.

Arrange Block

The arrange block sorts rows by column values. Select one or more columns to sort by, with each column having its own ascending or descending control.

When sorting by multiple columns, the order matters. The first column is the primary sort key. Rows with the same value in the first column are then sorted by the second column, and so on. Use the drag handles to reorder the sort columns.

Add columns using the “+” button and remove them using the “×” button. Toggle between ascending and descending order for each column independently.

Slice Block

The slice block selects specific rows based on different criteria. Choose from six slice types: head (first rows), tail (last rows), min (rows with smallest values), max (rows with largest values), sample (random selection), or custom (specific positions).

For head and tail types, specify the number of rows using n (count) or prop (proportion between 0 and 1). For min and max types, select an order_by column and enable with_ties if you want to include all rows with tied values. For sample type, optionally select a weight_by column for weighted sampling and enable replace for sampling with replacement.

The custom type accepts a rows expression like “1:5” or “c(1, 3, 5, 10)”. All slice types support grouping via the by parameter, which performs the slice operation within each group separately.

Mutate Expression Block

The mutate expression block creates new columns or modifies existing ones using R expressions. Add multiple expressions, each creating or updating a column. Each expression consists of a column name and an R expression that calculates its value.

Use mathematical operators (+, -, *, /, ^) and functions (sqrt(), log(), round(), etc.) in your expressions. Reference existing columns by name. You can also use conditional logic with ifelse() or dplyr::case_when().

Expression order matters: later expressions can reference columns created by earlier expressions in the same mutate block. The by parameter allows grouping, making column references operate within each group. Add expressions with the “+ Add Expression” button and remove them with the “×” button.

Rename Block

The rename block changes column names. Each rename operation maps a new name to an existing column. The interface shows the mapping as “new_name ← old_name” with a visual arrow indicator.

Select the existing column from a dropdown to ensure valid column names. Type the new name in the text field. Add multiple renames using the “+” button to rename several columns at once. Remove a rename operation with the “×” button.

The block validates that you don’t rename the same column twice and ensures column names don’t conflict with existing names.

Summarize Block

The summarize block calculates summary statistics using a point-and-click interface. Each summary consists of three parts: a name for the new column, an aggregation function selected from a dropdown, and the column to aggregate.

Available aggregation functions include mean, sum, minimum, maximum, count, count distinct, median, standard deviation, and more. Select the function from the dropdown and the column to apply it to.

Use the “Columns to group by” selector to group data before summarizing. When grouping is enabled, statistics are calculated separately for each group. Add multiple summaries using the “+ Add Summary” button.

For more complex aggregations using custom R expressions, use the summarize expression block instead.

Summarize Expression Block

The summarize expression block calculates summary statistics using R expressions. Use this block for more elaborate aggregations that require custom expressions, such as weighted means, ratios, or functions with specific parameters.

Enter expressions like mean(mpg), sum(hp), dplyr::n(), or more complex calculations like mean(mpg, na.rm = TRUE) or sum(hp) / dplyr::n(). The expression editor provides syntax highlighting and autocomplete (Ctrl+Space).

Use the “Columns to group by” selector to group data before summarizing. The “Show advanced options” section provides additional settings like the unpack option for handling functions that return data frames.

Join Block

The join block combines two datasets based on matching values in specified columns. Select from six join types that determine which rows are kept in the result.

Join types: left_join keeps all rows from the left dataset and matching rows from the right; right_join keeps all rows from the right dataset and matching rows from the left; inner_join keeps only rows that match in both datasets; full_join keeps all rows from both datasets; semi_join filters the left dataset to rows that have a match in the right; anti_join filters the left dataset to rows that do not have a match in the right.

The “Custom Column Mappings” interface supports both same-name joins (when columns have identical names) and different-name joins (when the matching columns have different names in each dataset). Add multiple join keys to match on multiple columns simultaneously. Enable “Use natural join” to automatically join on all common columns.

Bind Rows Block

The bind rows block stacks datasets vertically by matching column names. Rows from each input dataset are combined into a single output dataset.

Columns are matched by name. If datasets have different columns, the result includes all columns from all datasets. Missing columns are filled with NA values. The order of columns in the output follows the order they appear across all input datasets.

The “Show advanced options” section provides the id_name option which adds an identifier column that tracks which source dataset each row came from. This is useful when combining data from multiple sources and you need to maintain provenance.

Bind Columns Block

The bind columns block combines datasets side-by-side horizontally. Columns from each input dataset are placed next to each other in the output.

All input datasets must have exactly the same number of rows. The rows are combined by position: the first row from each dataset forms the first row of the output, the second rows form the second row of the output, and so on.

If datasets have columns with the same name, they are automatically renamed with numeric suffixes (e.g., “Sepal.Length…1”, “Sepal.Length…6”) to avoid conflicts.

tidyr Blocks

The following blocks provide reshaping operations from the tidyr package.

Pivot Longer Block

The pivot longer block reshapes data from wide to long format using tidyr::pivot_longer(). Use this when column names represent values of a variable rather than variables themselves.

Select which columns to pivot. These columns are transformed into two new columns: one containing the original column names (names_to parameter, default “name”) and another containing the values (values_to parameter, default “value”). Unselected columns remain as identifiers.

The “Show advanced options” section provides names_prefix (removes common prefixes from column names) and values_drop_na (removes rows where the value is NA). This is useful for reshaping time series data, survey responses, or preparing data for visualization.

Pivot Wider Block

The pivot wider block reshapes data from long to wide format using tidyr::pivot_wider(). This is the inverse of pivot longer, creating a summary table where row-column combinations become cells.

Select which column contains values for new column names (names_from) and which column contains cell values (values_from). The id_cols parameter specifies which columns identify each row. If empty, all columns not in names_from or values_from are used as identifiers.

The “Show advanced options” section provides names_prefix (adds a prefix to new column names) and values_fill (provides a value for missing combinations). This is useful for creating crosstabs, pivot tables, or comparing values across categories.

Separate Block

The separate block splits a single column into multiple columns using tidyr::separate(). Use this when a column contains combined values that should be in separate columns.

Select the column to separate and specify the names for the new columns (comma-separated). Enter the separator character or regular expression that divides the values.

The “Show advanced options” section provides remove (whether to remove the input column), convert (whether to convert new columns to appropriate types), and extra/fill options for handling rows with too many or too few pieces.

Unite Block

The unite block combines multiple columns into a single column using tidyr::unite(). This is the inverse of separate, joining values with a separator.

Select the columns to unite and specify the name for the new combined column. Enter the separator character to place between values (default is “_“).

The “Show advanced options” section provides the remove option (whether to remove the input columns after uniting) and na.rm (whether to remove NA values before uniting).

Building Data Pipelines

Blocks work together in pipelines. The output from one block becomes the input to the next. Each block shows a preview of the data at that stage.