dplyr Cheat Sheet: R Data Wrangling Functions & Examples

Q: How do I select specific columns with dplyr?

Use select(): df %>% select(col1, col2) keeps only those columns. Use select(-col1) to drop a column, or helpers like starts_with() to select by pattern.

Q: What's the difference between dplyr and base R for data manipulation?

Base R uses bracket notation and nested function calls. dplyr uses named verbs chained with the pipe operator, which is generally more readable for multi-step transformations.

Q: Is dplyr part of the tidyverse?

Yes, dplyr is one of the core tidyverse packages alongside ggplot2, tidyr, and readr. Installing tidyverse installs dplyr automatically.

Q: How do I install dplyr if install.packages() fails?

Confirm an active internet connection and retry install.packages("dplyr"). If it still fails, install the development version via pak::pak("tidyverse/dplyr").

Q: How do I join two data frames in dplyr?

Use left_join(), inner_join(), right_join(), or full_join() with a shared key column, e.g. df1 %>% left_join(df2, by = "id").

dplyrcheat sheet covers every core function for filtering, selecting, mutating, grouping, summarising, and joining data in R — grouped by task, not alphabetically, with real output from the built-in mtcars dataset so you can check your own results against it.

dplyr cheat sheet for R data wrangling — filter, select, mutate, group_by, summarise, and joins

Table of Contents

Key Takeaways

dplyr is the tidyverse package for data wrangling in R — filtering, selecting, transforming, grouping, and joining data frames.
The pipe (%>% or base R's |>) chains functions left-to-right so you can read code as a sequence of steps instead of nested parentheses.
filter(), select(), mutate(), group_by(), and summarise() cover roughly 90% of everyday data wrangling tasks.
Joins (left_join(), inner_join(), full_join(), anti_join()) combine two data frames on a shared key — this is the section most cheat sheets skip and the one you'll need most for real projects.
Every example below runs against R's built-in mtcars dataset, so you can paste the code into RStudio and get the exact same output shown here.

Install and Load dplyr

Install once, load every session:

install.packages("tidyverse")   # installs dplyr + the rest of the tidyverse
# or, just dplyr on its own:
install.packages("dplyr")

library(dplyr)

If you're new to R or RStudio itself, start here: Comprehensive Guide: How to Install RStudio.

The Pipe Operator: %>% and |>

The pipe passes the result on its left into the first argument of the function on its right. It turns nested function calls into a readable top-to-bottom sequence.

mtcars %>%
  filter(mpg > 25) %>%
  select(mpg, hp) %>%
  arrange(desc(mpg))

mtcars |>
  filter(mpg > 25) |>
  select(mpg, hp) |>
  arrange(desc(mpg))

|> is base R (4.1+, no package needed); %>% comes from magrittr/dplyr and has a few extra features (like placing the piped object anywhere with .). Either is fine — pick one and stay consistent within a script.

Note! Every example from here on uses R's built-in mtcars dataset (32 cars, no download needed) so you can run the exact code shown and get the exact output shown.

Row Functions: Filter, Arrange, Slice, Distinct

Row functions return a subset or reordering of rows — the table stays the same shape, columns don't change.

Function	What it does
`filter()`	Keep rows matching a condition
`arrange()`	Reorder rows by column value(s)
`slice()`	Select rows by position
`distinct()`	Remove or extract duplicate rows
`sample_n()` / `slice_sample()`	Randomly sample rows
`top_n()` / `slice_max()`	Select top/bottom N rows by a variable

filter() — keep rows matching a condition

mtcars %>%
  filter(mpg > 25)

            model   mpg hp
Fiat 128       32.4 66
Honda Civic    30.4 52
Toyota Corolla 33.9 65
Fiat X1-9      27.3 66
Porsche 914-2  26.0 91
Lotus Europa   30.4 113

Combine conditions with & (AND), | (OR), and %in% (value in a set):

mtcars %>% filter(mpg > 20 & cyl == 6)      # AND
mtcars %>% filter(cyl == 4 | cyl == 8)      # OR
mtcars %>% filter(cyl %in% c(4, 6))         # value in a set

arrange() — reorder rows

mtcars %>%
  arrange(desc(mpg)) %>%
  select(mpg) %>%
  head(3)

          model  mpg
Toyota Corolla  33.9
Fiat 128        32.4
Lotus Europa    30.4

Default is ascending; wrap the column in desc() for descending.

distinct() — unique rows

mtcars %>% distinct(cyl)

Column Functions: Select, Rename, Mutate, Transmute

Column functions change which columns exist or create new ones — row count stays the same.

Function	What it does
`select()`	Keep, drop, or reorder columns
`rename()`	Rename a column, keep everything else
`relocate()`	Move a column's position
`mutate()`	Add or modify a column, keep the rest
`transmute()`	Add a column, drop everything else

select() — choose columns

mtcars %>% select(mpg, hp, cyl)          # by name
mtcars %>% select(-cyl)                  # everything except cyl
mtcars %>% select(starts_with("m"))      # by prefix
mtcars %>% select(mpg:hp)                # range of columns

mutate() — add a new column

mtcars %>%
  mutate(mpg_kpl = round(mpg * 0.425, 2)) %>%
  select(mpg, mpg_kpl) %>%
  head(3)

       model  mpg mpg_kpl
Mazda RX4      21.0    8.92
Mazda RX4 Wag  21.0    8.92
Datsun 710     22.8    9.69

Grouping and Summarising: group_by, summarise, count

This is the "split-apply-combine" pattern: split the data into groups, apply a summary function to each, combine the results into one table.

group_by() + summarise()

mtcars %>%
  group_by(cyl) %>%
  summarise(
    avg_mpg = round(mean(mpg), 1),
    avg_hp  = round(mean(hp), 1),
    n       = n()
  )

  cyl avg_mpg avg_hp  n
1   4    26.7   82.6 11
2   6    19.7  122.3  7
3   8    15.1  209.2 14

Always call ungroup() after you're done with grouped operations, or later steps in the pipeline will silently stay grouped.

count() — quick group counts

mtcars %>% count(cyl)

Joining Two Data Frames

Joins combine two tables on a shared key column. This is the section most dplyr references skip — and the one you'll actually need once your data isn't already in one table.

cars <- tibble(
  id    = c(1, 2, 3, 4),
  model = c("Civic", "Corolla", "Model 3", "Mustang")
)

prices <- tibble(
  id    = c(1, 2, 3, 5),
  price = c(24000, 23500, 42000, 31000)
)

Join	Keeps	Code
`inner_join()`	Only rows with a match in both tables	`cars %>% inner_join(prices, by = "id")`
`left_join()`	All rows from the left table; unmatched right columns become NA	`cars %>% left_join(prices, by = "id")`
`right_join()`	All rows from the right table	`cars %>% right_join(prices, by = "id")`
`full_join()`	All rows from both tables	`cars %>% full_join(prices, by = "id")`
`anti_join()`	Rows in the left table with NO match in the right	`cars %>% anti_join(prices, by = "id")`

cars %>% left_join(prices, by = "id")

  id   model price
1  1   Civic 24000
2  2 Corolla 23500
3  3 Model 3 42000
4  4 Mustang    NA

Mustang (id 4) has no matching price, so left_join() keeps the row and fills price with NA. Swap to inner_join() and that row disappears entirely — that's the core distinction between the two.

Popular Articles:

Full dplyr Function Reference

Grouped by task for quick lookup. Copy the function name into RStudio's help (?function_name) for full argument details.

Row functions

Function	Description
`filter()`	Subset rows matching a condition
`arrange()`	Reorder rows
`slice()`	Select rows by position
`distinct()`	Unique rows
`slice_sample()`	Random row sample
`slice_max()` / `slice_min()`	Top/bottom N rows by a variable
`row_number()`	Row index within current order

Column functions

Function	Description
`select()`	Keep/drop/reorder columns
`rename()`	Rename a column
`relocate()`	Move column position
`mutate()`	Add/modify a column
`transmute()`	Add a column, drop the rest
`across()`	Apply one function across multiple columns
`if_else()` / `case_when()`	Conditional value assignment
`coalesce()`	First non-missing value across columns

Grouping and summarising

Function	Description
`group_by()`	Group rows by one or more columns
`ungroup()`	Remove grouping
`summarise()`	Collapse each group to one summary row
`count()`	Count rows per group
`n()`	Count of rows in current group

Joins

Function	Description
`inner_join()`	Matching rows only
`left_join()`	All left rows
`right_join()`	All right rows
`full_join()`	All rows, both tables
`anti_join()`	Left rows with no match in right
`bind_rows()` / `bind_cols()`	Stack tables vertically/horizontally

Frequently Asked Questions

Is there a downloadable PDF version of this dplyr cheat sheet?

Not yet as a standalone PDF — this page is a living reference we update as dplyr changes, which a static PDF can't do. You can bookmark or print this page directly from your browser (Ctrl/Cmd + P) for an offline copy.

What is dplyr used for in R?

dplyr provides a consistent set of functions — filter, select, mutate, group_by, summarise, and join — for wrangling data frames in R. It's part of the tidyverse.

How do I select specific columns with dplyr?

Use select(): df %>% select(col1, col2) keeps only those columns. Use select(-col1) to drop a column instead, or helpers like starts_with("x") to select by pattern.

What's the difference between dplyr and base R for data manipulation?

Base R uses bracket notation (df[df$mpg > 25, ]) and nested function calls. dplyr uses named verbs (filter(), select()) chained with the pipe, which is generally more readable for multi-step transformations and is the standard used across the tidyverse.

Is dplyr part of the tidyverse?

Yes — dplyr is one of the core tidyverse packages, alongside ggplot2, tidyr, and readr. Installing tidyverse installs dplyr automatically.

How do I install dplyr if install.packages() fails?

Confirm you have an active internet connection and try install.packages("dplyr") again. If it still fails, install the development version from GitHub: install.packages("pak"); pak::pak("tidyverse/dplyr").

What does the %>% pipe operator do in dplyr?

It passes the result on its left into the first argument of the function on its right, letting you chain multiple operations top-to-bottom instead of nesting function calls. Base R's built-in |> pipe works the same way without needing a package.

How do I join two data frames in dplyr?

Use left_join(), inner_join(), right_join(), or full_join() with a shared key column: df1 %>% left_join(df2, by = "id"). See the Joining Two Data Frames section above for what each one keeps.

📊

Need this analysis done for your thesis or dissertation? I'll handle it in R, SPSS, or Minitab — with APA-formatted results delivered fast.

Chat on WhatsApp View Services & Pricing

RStudioDataLab