Key Takeaways
- The mutate function from the dplyr package allows you to create new variables or modify existing variables in a data frame or a tibble in R.
- The variants of mutate, such as mutate_all, mutate_at, mutate_if, and mutate_across, allow you to apply functions to all selected or conditional variables or variables that match a pattern.
- The case_when function allows you to create new variables based on multiple conditions, using a logical expression and a value for each case.
- The glimpse, Kable, ggplot2, and diagrammeR functions allow you to display your data as tables and graphs, with some formatting options.
- The mutate function is useful for data analysis in R because it allows you to manipulate your data flexibly, consistently, and efficiently and works well with other dplyr functions and packages.
Table of Contents
Hi, I’m Zubair Goraya, a data analyst with 5 years of experience. I love writing about data analysis in R. I will explain how to create new variables in R with dplyr in this article. dplyr is a package that provides functions for manipulating data frames and tibbles in R.
Mutate Syntax in R
Tibbles are a modern reimagining of data frames that are more consistent and convenient. dplyr functions are designed to be easy to use, fast, and consistent, and they follow the principle of “tidy data”, which means that each variable is a column, each observation is a row, and each value is a cell.
mutate is one of the main functions of dplyr, and it allows you to create new variables or modify existing variables in a data frame or a tibble. You can use mutate to perform various operations on your data, such as calculations, transformations, conditions, combinations, etc. mutate also works well with other dplyr functions, such as group_by, summarise, filter, arrange, etc.
Basics of mutate
How to install and load dplyr
To use dplyr, you need to install it first. You can do that by running the following code in R:install.packages("dplyr")
Then, you need to load it into your R session. You can do that by running the following code in R:
library(dplyr)
How to create a data frame or a tibble
# Create a data frame with 10 rows and 4 columnsdf <- data.frame( name = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Harry", "Ivy", "Jack"), age = c(25, 32, 28, 24, 27, 29, 31, 26, 30, 33), gender = c("F", "M", "M", "M", "F", "M", "F", "M", "F", "M"), score = c(85, 76, 92, 81, 88, 79, 94, 83, 90, 86) )
How to use mutate to create a new variable
To create a new variable with mutate, you need to use the following syntax:mutate(data, new_variable = expression)
Where
- data is the name of the data frame or the tibble,
- new_variable is the new variable's name, and expression is the formula or function that defines the new variable.
- If the score is greater than or equal to 90, then the grade is A
- If the score is between 80 and 89, then the grade is B
- If the score is between 70 and 79, then the grade is C
- If the score is less than 70, then the grade is D
# Create a new variable called grade with mutate df %>% mutate(grade = case_when( score >= 90 ~ "A", score >= 80 & score < 90 ~ "B", score >= 70 & score < 80 ~ "C", score < 70 ~ "D" ))
It will create a new variable, grade, using the pipe operator (%>%) from the magrittr package, loaded with dplyr. The pipe operator (%>%) allows you to chain multiple functions together without repeating the data frame name.
The case_when function allows you to create a new variable based on multiple conditions, using the tilde (~) to separate the condition and the value.
How to use mutate to modify an existing variable
To modify an existing variable with mutate, you need to use the same syntax as creating a new variable but use the name of the existing variable instead of a new variable. For example, if you want to modify the age variable by adding 1 to each value, you can use the following code in R:
# Modify the age variable with mutate df %>% mutate(age = age + 1)
It will modify the age variable and assign it to the data frame df using the pipe operator.
Related Posts
How to use mutate to create multiple new variables
To create multiple new variables with mutate, you need to use the same syntax as creating a single new variable but separate each new variable with a comma. For example, if you want to create two new variables called height and weight, based on some random values, you can use the following code in R# Create two new variables called height and weight with mutate df %>% mutate( height = runif(n = n(), min = 150, max = 200), # Generate a random number between 150 and 200 for each row weight = runif(n = n(), min = 50, max = 100) # Generate a random number between 50 and 100 for each row )
Using the pipe operator, it will create two new variables, height, and weight, and assign them to the data frame df. The runif function allows you to generate a random number between a minimum and a maximum value, and the n function will enable you to get the number of rows in the data frame.
How to use mutate to delete a variable
To delete a variable with mutate, you need to use the same syntax as creating a new variable but use NULL as the expression. For example, if you want to delete the score variable, you can use the following code in R:# Delete the score variable with mutate df %>% mutate(score = NULL)
Advanced features of mutate
How to use mutate_all to apply a function to all variables
To apply a function to all variables with mutate, you need to use the mutate_all function, a mutate variant. You need to use the following syntax:mutate_all(data, function)
Where
- data is the name of the data frame or the tibble,
- function is the function name you want to apply to all variables.
# Round all the numeric variables with mutate_all
df %>% select_if(is.numeric) %>%
mutate_all(round)
Using the pipe operator, it will round all the numeric variables and assign the updated data frame to df. The round function allows you to round a number to the nearest integer or a specified number of decimal places.
How to use mutate_at to apply a function to selected variables
To apply a function to selected variables with mutate, you need to use the mutate_at function, another mutate variant. You need to use the following syntax:mutate_at(data, vars, function)
Where
- data is the name of the data frame or the tibble
- vars is a vector of variable names or positions you want to select, and function is the name of the function you wish to apply to the selected variables.
# Convert the gender variable to uppercase with mutate_at df %>% mutate_at(vars(gender), toupper)
It will convert the gender variable to uppercase, and assign the updated data frame to df, using the pipe operator. The toupper function allows you to convert a character string to uppercase. You can also use the vars function to select variables by name or by using helper functions, such as starts_with, ends_with, contains, matches, etc.
How to use mutate_if to apply a function to variables that meet a condition
To apply a function to variables that meet a condition with mutate, you need to use the mutate_if function, another mutate variant. You need to use the following syntax:
mutate_if(data, predicate, function)
Where
- data is the name of the data frame or the tibble,
- the predicate is a logical expression that defines the condition,
- function is the function name you want to apply to the variables that meet the condition.
# Convert all the character variables to lowercase with mutate_if df %>% mutate_if(is.character, tolower)
You can also use other functions to define the predicate, such as is.numeric, is.factor, is.logical, etc.
How to use mutate with group_by functions
To perform more complex data manipulation tasks, you can mutate with other dplyr functions, such as group_by, summarise, filter, arrange, etc. For example, if you want to create a new variable called rank, which shows the rank of each person based on their score within each gender group, you can use the following code in R:
# Create a new variable called rank with mutate and other dplyr functions df %>% group_by(gender) %>% # Group the data by gender mutate(rank = rank(-score)) %>% # Create a new variable called rank, which is the rank of each person based on their score, within each gender group ungroup() # Ungroup the data
It will create a new variable called rank, and assign the updated data frame to df using the pipe operator. The group_by function allows you to group the data by one or more variables, and the ungroup function will enable you to remove the grouping.
The rank function will allow you to rank the values of a variable, and the minus sign (-) allows you to rank them in descending order.
Examples and code snippets
How to create a new variable based on a condition
You can use the case_when function to create a new variable based on a condition, as shown in the previous example of creating the grade variable. Here is another example of creating a new variable called pass, which shows whether the person passed or failed the test based on the score variable, using the following criteria:
- If the score is greater than or equal to 80, then pass is “Yes”
- If the score is less than 80, then the pass is “No”
# Create a new variable called pass with case_when df %>% mutate(pass = case_when( score >= 80 ~ "Yes", score < 80 ~ "No"))
Using the pipe operator, it will create a new pass variable.
How to create a new variable based on a calculation
You can use any arithmetic or mathematical operators or functions to create a new variable based on a calculation, as shown in the previous example of modifying the age variable. Here is another example of creating a new variable called BMI, which shows the body mass index of each person based on the height and weight variables, using the following formula:# Create a new variable called bmi with a calculation # Set seed for reproducibility set.seed(123) # Generate dummy data num_rows <- 100 weights <- runif(num_rows, min = 50, max = 100) # Generating random weight values heights <- runif(num_rows, min = 150, max = 190) # Generating random height values # Create a data frame with the generated data df1 <- tibble(weight = weights, height = heights) # Use the provided code to calculate BMI and create a new column bmi df1 %>% mutate(bmi = weight / (height / 100) ^ 2)
It will create a new variable called bmi, and assign the updated data frame to df, using the pipe operator. The ^ operator allows you to raise a number to a power.
How to create a new variable based on a transformation
You can use any transformation functions, such as log, exp, sqrt, etc., to create a new variable based on a transformation, as shown in the previous example of rounding all the numeric variables. Here is another example of creating a new variable called log_score, which shows the natural logarithm of the score variable using the following code in R:# Create a new variable called log_score with a transformation df %>% mutate(log_score = log(score))
It will create a new variable called log_score and assign the updated data frame to df, using the pipe operator. The log function allows you to calculate the natural logarithm of a number.
How to create a new variable based on a combination of other variables
You can use any operators or functions that allow you to combine or concatenate other variables to create a new variable based on a combination of other variables, as shown in the previous example of creating the rank variable.Here is another example of creating a new variable called id, which offers a unique identifier for each person based on the name and age variables, using the following code in R:
# Create a new variable called id with a combination of other variables df %>% mutate(id = paste0(name, "_", age))
Using the pipe operator, it will create a new variable, id, and assign the updated data frame to df. The paste0 function allows you to concatenate character strings, and the underscore (_) is used as a separator.
Visuals and tables
How to use the glimpse function to see the structure of a data frame
You can use the glimpse function to see the structure of a data frame or a tibble, such as the number of rows, columns, and variables and the type and class of each variable. You need to use the following syntax:glimpse(data)
Where data is the name of the data frame or the tibble. For example, if you want to see the structure of the df data frame, you can use the following code in R:
# See the structure of the df data frame with glimpse glimpse(df)
It will show the number of rows and columns and each variable's name, type, and class in the data frame.
How to use the kable function to display a data frame as a table
You can use the kable function from the knitr package to display a data frame or a tibble as a table, with some formatting options. You need to use the following syntax:kable(data, format, caption, align, col.names, row.names, digits, etc.)
Where
- data is the name of the data frame or the tibble,
- format is the output format, such as “markdown”, “html”, “latex”, etc.,
- caption is the title of the table,
- align is the alignment of the columns, such as “l” for left, “r” for right, “c” for center, etc.,
- col.names is the vector of column names,
- row.names is the vector of row names,
- digits is the number of decimal places to show, etc.
# Display the df data frame as a markdown table with kable library(knitr) # Load the knitr package kable(df, format = "markdown", caption = "A data frame with 10 rows and 12 columns", align = "c", digits = 2)
The df data frame will be displayed as a markdown table, with the specified options.
How to use the ggplot2 package to create graphs from a data frame
With some customization options, you can use the ggplot2 package to create graphs from a data frame or a tibble. You need to use the following syntax:ggplot(data, aes(x, y, color, fill, etc.)) + geom_point, geom_line, geom_bar, etc. + labs(title, x, y, etc.) + theme, scale, etc.
Where
- data is the name of the data frame or the tibble,
- aes is the aesthetic mapping that defines the variables to plot, such as x, y, color, fill, etc.,
- geom_point, geom_line, geom_bar, etc. are the geometric objects that define the type of plot, such as point, line, bar, etc.,
- labs are the label for the title, x-axis, y-axis, etc.,
- theme, scale, etc. are the options for the appearance, such as theme, scale, etc.
df<-cbind(df,df1) # Create a scatter plot of the height and weight variables with ggplot2 library(ggplot2) # Load the ggplot2 package ggplot(df, aes(x = height, y = weight, color = gender, shape = gender, size = score)) + # Define the data and the aesthetic mapping geom_point() + # Define the geometric object as point labs(title = "Height vs Weight Scatter Plot", x = "Height (cm)", y = "Weight (kg)") + # Define the labels for the title and the axes theme_bw() # Define the theme as black and white
Conclusion
In this article, I have introduced the mutate function from the dplyr package. You have learned how to use mutate to create or modify new variables in a data frame or tibble. You have also learned how to use the variants of mutate, such as mutate_all, mutate_at, mutate_if, and mutate_across. Finally, you have learned how to use visuals and tables to show the input and output of mutate.
To use mutate, follow the steps and syntax in this article. You can also refer to the examples and code snippets. Explore the documentation and vignettes of the dplyr package to learn more about mutate and its variants. Practice using mutate with your own data sets.
Thank you for reading this article. I hope you have enjoyed learning to use mutate to create new variables in R with dplyr.
If you have any questions, comments, or feedback, please leave them below.
Further Reads
- R for Data Science by Hadley Wickham and Garrett Grolemund. This book covers the basics of data transformation with dplyr, including how to use mutate and its variants, with examples and exercises.
- Data Manipulation with dplyr by Hadley Wickham. This vignette provides an overview of the dplyr package, its philosophy, and its main functions, such as mutate, group_by, summarise, etc., with examples and code snippets.
- Introduction to dplyr by Bradley Boehmke. This tutorial introduces the dplyr package, its grammar, and its functions, such as mutate, filter, arrange, etc., with examples and interactive exercises.
- dplyr Cheat Sheet by RStudio. This cheat sheet summarizes the most common and useful functions and options of the dplyr package, such as mutate, select, rename, etc., with examples and diagrams.
Frequently Asked Questions (FAQs)
How do I create a new variable in R?
You can use the case_when function inside mutate to create a new variable based on the condition in R. You need to use the following syntax:
mutate(data, name = case_when(condition1 ~ value1, condition2 ~ value2, etc.))
where data is the name of the data frame or the tibble, name is the name of the new variable, condition1, condition2, etc. are the logical expressions that define the conditions, and value1, value2, etc. are the values for each case.
For example
mutate(df, grade = case_when(score >= 90 ~ "A", score >= 80 ~ "B", score >= 70 ~ "C", TRUE ~ "D"))
What command will create new variables with functions of existing variables using dplyr?
You can use any arithmetic or mathematical operators or functions inside mutate to create new variables with functions of existing variables using dplyr. You need to use the existing variables as arguments for the functions.
For example
mutate(df, log_score = log(score))
What does %>% mean in R?
The %>% operator, or pipe operator, means “then” in R.
It allows you to chain multiple functions together without nesting them or creating intermediate objects. It passes the output of the left-hand side as the first argument of the right-hand side.
For example
df %>% mutate(age = age + 1) %>% filter(age > 30)
What is the use of dplyr in R?
The dplyr package is a powerful and user-friendly tool for data manipulation in R. It provides a consistent and intuitive set of functions to perform common data manipulation tasks, such as selecting, filtering, grouping, summarizing, arranging, joining, and mutating data.
It also works well with other packages, such as tidyr, ggplot2, and knitr, to enable a tidy and reproducible data analysis workflow in R.
How do you create a variable?
You can create a variable by assigning a value to a name, using the assignment operator (<- or =) in R. For example
x <- 10
How to add two variables in R?
You can add two variables in R using the addition operator (+).
For example
y <- x + 5
How do you create a new variable in the data step?
The data step is a part of the SAS programming language, which differs from R. However, you can create a new variable in the data step by using the assignment statement, similar to R.
For example, data new; set old; z = x + y; run; will create a new data set called new, based on the old data set, and create a new variable called z, with the value of x plus y.
What function helps create new variables in R?
As explained above, the mutate function from the dplyr package helps create new variables in R. You can also use other functions, such as case_when, paste, log, round, etc., inside mutate to create new variables based on conditions, combinations, transformations, etc.
What is the command used to create a new variable?
The command to create a new variable depends on the programming language and the package you are using. In R, you can use the assignment operator (<- or =) or the mutate function from the dplyr package to create a new variable, as explained above.
How do I rename a variable in R using dplyr?
You can rename a variable in R using dplyr by using the rename function. You need to use the following syntax:
rename(data, new_name = old_name)
rename(df, score_new = score)
You can use the mutate function from the dplyr package to create a new variable in R. You need to use the following syntax:
mutate(data, name = value)
mutate(df, score = 100 * rand())
Which dplyr operation is used to add new variables to a data set?
The mutate function is the dplyr operation that adds new variables to a data set. You can also use the variants of mutate, such as mutate_all, mutate_at, mutate_if, and mutate_across, to apply functions to all, selected or conditional variables or variables that match a pattern.
Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. To hire me, you can visit this link and fill out the order form. You can also contact me at info@rstudiodatalab.com for any questions or inquiries. I will be happy to work with you and provide you with high-quality data analysis services.