## Key points

- Data normalization is transforming the values of a variable or a set of variables so that they have a standard scale or range.
- Different ways to normalize data in R include
**z-score normalization, min-max normalization, range normalization, decimal scaling, and max_scale normalization**. **Z-score normalization**transforms each value by subtracting its mean and dividing it by its standard deviation. The result is a new variable with a mean of zero and a standard deviation of one.**Min-max normalization**transforms each value by subtracting its minimum and dividing by its range (maximum-minimum). The result is a new variable with a minimum of zero and a maximum of one.**Range normalization**transforms each value by dividing by its range (maximum-minimum). The result is a new variable that has a range of one.**Decimal scaling transforms**each value by dividing by a power of 10 that is equal to or larger than the maximum absolute value of the variable. The result is a new variable ranging from -1 to 1.**Max_scale normalization**transforms each value by multiplying by a specific maximum value and dividing by the actual maximum value of the variable. The result is a new variable with a maximum of the particular value.- Normalization can reduce the
*effects of outliers, improve the performance of statistical tests*and*machine learning algorithms*, and make the data more comparable and interpretable. *Normalization is not a one-size-fits-all solution*, and we need to choose the appropriate method based on the type and distribution of the data, the purpose of the analysis, and the desired outcome.

## Hands-On Coding Experience

## Data Normalization in R: A Comprehensive Guide

Data normalization is transforming the values of a variable or a set of variables so that they have a standard scale or range. Normalization can reduce the effects of outliers, improve the performance of statistical tests and machine learning algorithms, and make the data more comparable and interpretable.

There are different ways to normalize data in R, depending on the type and distribution of the data, the purpose of the analysis, and the desired outcome.

In this article, we will cover some of the most common methods and scenarios for data normalization in R. We will use some examples from the Data analysis website, which provides tutorials related to RStudio, a popular integrated development environment for R.

We will also use some packages from CRAN, the comprehensive R archive network, which hosts thousands of packages for various purposes.

```
# Install packages if not already installed
install.packages(c( "car", "dummy"))
# Load packages
library(car)
library(dummy)
```

**People Also Read:**

### How to normalize data in R with a dummy data set

One way to learn how to normalize data in R is by using a dummy data set with known properties and characteristics. For example, we can use the mtcars data set with the base R installation.

This data set contains information about 32 cars from a 1974 Motor Trend magazine, such as miles per gallon (mpg), number of cylinders (cyl), displacement (disp), horsepower (hp), weight (wt), and so on. Learn more about mtcars data set. To see the structure and summary of the mtcars data set, we can use the following commands:

# See the structure of mtcars

str(mtcars)

# See the summary of mtcars

summary(mtcars)

The output shows that the mtcars data set has 32 observations (rows) and 11 variables (columns), all of which are numeric except for am, which is binary (0 = automatic transmission, 1 = manual transmission). The summary also shows descriptive statistics for each variable, such as the minimum, maximum, mean, median, and quartiles.

We can see that the variables have different scales and ranges. For example, mpg ranges from 10.4 to 33.9, while disp ranges from 71.1 to 472.0. It means some variables may have more influence or variance than others when we perform statistical tests or machine learning algorithms. To avoid this problem, we can normalize the data so that all variables have a common scale or range.

One of the most common methods for normalizing data is **z-score normalization or standardization**.

This method transforms each variable value by subtracting its mean and dividing it by its standard deviation. The result is a new variable with a mean of zero and a standard deviation of one. Each value represents how many standard deviations it is away from the mean.

To perform z-score normalization in R, we can use the scale() function. This function takes a numeric vector or matrix as input and returns a scaled version.

For example, we can apply z-score normalization to the mpg variable of the mtcars data set as follows:

# Z-score normalize the mpg variable mtcars$mpg_z <- scale(mtcars$mpg) # See the first six rows of the normalized mpg variable head(mtcars$mpg_z)

The output shows that the mpg variable has been transformed into a new variable called mpg_z, with a mean of zero and a standard deviation of one.

We can verify this by using the mean() and sd() functions:

```
# Check the mean and standard deviation of the normalized mpg variable
mean(mtcars$mpg_z)
sd(mtcars$mpg_z)
```

The output confirms that the normalized mpg variable's mean and standard deviation are close to zero and one, respectively.

We can also apply**z-score normalization to all variables**of the mtcars data set at once by passing the entire data frame to the scale() function. This will return a scaled value matrix, which we can convert to a data frame using the as.data.frame() function. For example:

```
# Z-score normalize all variables of mtcars
mtcars_z <- as.data.frame(scale(mtcars))
# See the structure and summary of the normalized mtcars data frame
str(mtcars_z)
summary(mtcars_z)
```

The output shows that the mtcars_z data frame has the same dimensions and variable names as the original mtcars data frame, but all variables have been transformed to have a mean of zero and a standard deviation of one.

We can also visualize the effect of z-score normalization on the data by using some plots. For example, we can use the hist() function to create histograms for the original and normalized mpg variables:```
# Create histograms for original and normalized mpg variables
par(mfrow = c(1, 2)) # Set up plot layout
hist(mtcars$mpg, main = "Original mpg", xlab = "Miles per gallon")
hist(mtcars_z$mpg, main = "Normalized mpg", xlab = "Z-scores")
```

We can also use the boxplot() function to create boxplots for the original and normalized variables of the mtcars data set:

```
# Create boxplots for original and normalized variables of mtcars
par(mfrow = c(2, 1)) # Set up plot layout
boxplot(mtcars, main = "Original mtcars", las = 2) # Rotate labels by 90 degrees
boxplot(mtcars_z, main = "Normalized mtcars", las = 2) # Rotate labels by 90 degrees
```

### How to normalize data by min-max normalization in the R scale

Another method for normalizing data is min-max normalization or rescaling. This method transforms each variable value by *subtracting its minimum and dividing by its range*. The result is a new variable with a minimum of zero and a maximum of one. Each value represents its distance from the minimum relative to the range.

To perform min-max normalization in R, we can use the scale() function from the base package again, but with some additional arguments. The scale() function takes two optional arguments: center and scale.

- The center argument specifies whether to subtract the mean from each value (default is TRUE).
- The scale argument specifies whether to divide by the standard deviation (default is TRUE).

For example, we can apply min-max normalization to the mpg variable of the mtcars data set as follows:

```
# Min-max normalize the mpg variable
mtcars$mpg_mm <- scale(mtcars$mpg, center = min(mtcars$mpg), scale = max(mtcars$mpg) - min(mtcars$mpg))
# See the first six rows of the normalized mpg variable
head(mtcars$mpg_mm)
```

**mpg_mm**, which has a minimum of zero and a maximum of one.

We can verify this by using the min() and max() functions:

```
# Check the minimum and maximum of the normalized mpg variable
min(mtcars$mpg_mm)
max(mtcars$mpg_mm)
```

The output confirms that the minimum and maximum of the normalized mpg variable are both zero and one, respectively.

We can also apply min-max normalization to all variables of the mtcars data set at once by passing the entire data frame to the scale() function, along with the appropriate values for centring and scaling. For example:```
# Min-max normalize all variables of mtcars
mtcars_mm <- as.data.frame(scale(mtcars, center = apply(mtcars, 2, min), scale = apply(mtcars, 2, max) - apply(mtcars, 2, min)))
# See the structure and summary of the normalized mtcars data frame
str(mtcars_mm)
summary(mtcars_mm)
```

The output shows that the mtcars_mm data frame has the same dimensions and variable names as the original mtcars data frame, but all variables have been transformed to have a minimum of zero and a maximum of one.

Using some plots, we can also visualize the effect of min-max normalization on the data. For example, we can use the hist() function to create histograms for the original and normalized mpg variables:```
# Create histograms for original and normalized mpg variables
par(mfrow = c(1, 2)) # Set up plot layout
hist(mtcars$mpg, main = "Original mpg", xlab = "Miles per gallon")
hist(mtcars_mm$mpg, main = "Normalized mpg", xlab = "Min-max values")
```

The histograms show that min-max normalization does not change the shape or distribution of the data but only scales it to a range from zero to one. The original mpg variable ranges from 10.4 to 33.9, while the normalized mpg variable ranges from 0 to 1. The mean of the original mpg variable is 20.09, while the mean of the normalized mpg variable is 0.48.

We can also use the boxplot() function to create boxplots for the original and normalized variables of the mtcars data set:

```
# Create boxplots for original and normalized variables of mtcars
par(mfrow = c(2, 1)) # Set up plot layout
boxplot(mtcars, main = "Original mtcars", las = 2) # Rotate labels by 90 degrees
boxplot(mtcars_mm, main = "Normalized mtcars", las = 2) # Rotate labels by 90 degrees
```

### How to normalize data in R with car

Another method for normalizing data is **range normalization or unitization**. This method transforms each variable value by dividing it by its range. The result is a new variable that has a range of one. Each value represents how far it is from zero relative to the field.

We can use the **renormalization from the car package to perform range normalisation **in R. This function takes a numeric vector or matrix as input and returns a recorded version based on some rules or formulas. For example, we can apply range normalization to the mpg variable of the mtcars data set as follows:

```
# Range normalize the mpg variable
mtcars$mpg_rn <- (mtcars$mpg - min(mtcars$mpg)) / (max(mtcars$mpg) - min(mtcars$mpg))
# See the first six rows of the normalized mpg variable
head(mtcars$mpg_rn)
```

The output shows that the mpg variable has been transformed into a new variable called mpg_rn, which has a range of one.

We can verify this by using the range() function:

```
# Check the range of the normalized mpg variable
range(mtcars$mpg_rn)
```

The output confirms that the range of the normalized mpg variable is one.

We can also apply range normalization of all mtcars data set variables at once by passing the entire data frame to the recode() function, along with the appropriate formula. For example:```
# Range normalize all variables of mtcars
library(dplyr)
mtcars_rn <- mtcars %>%
mutate(across(everything(), ~ (.-min(.)) / (max(.) - min(.))))
# See the structure and summary of the normalized mtcars data frame
str(mtcars_rn)
summary(mtcars_rn)
```

The output shows that the mtcars_rn data frame has the same dimensions and variable names as the original mtcars data frame, but all variables have been transformed to have a range of one.

We can also visualize the effect of range normalization on the data by using some plots. For example, we can use the hist() function to create histograms for the original and normalized mpg variables:```
# Create histograms for original and normalized mpg variables
par(mfrow = c(1, 2)) # Set up plot layout
hist(mtcars$mpg, main = "Original mpg", xlab = "Miles per gallon")
hist(mtcars_rn$mpg, main = "Normalized mpg", xlab = "Range values")
```

We can also use the boxplot() function to create boxplots for the original and normalized variables of the mtcars data set:

```
# Create boxplots for original and normalized variables of mtcars
par(mfrow = c(2, 1)) # Set up plot layout
boxplot(mtcars, main = "Original mtcars", las = 2) # Rotate labels by 90 degrees
boxplot(mtcars_rn, main = "Normalized mtcars", las = 2) # Rotate labels by 90 degrees
```

### How to normalize data in R that has positive and negative values

Sometimes, we may encounter positive and negative data, such as temperature, elevation, or profit.

Normalizing such data can be tricky, as some methods may not preserve the sign or magnitude of the values. For example, z-score normalization may result in negative values for originally positive variables or vice versa. Min-max normalization may result in values close to zero for variables that had large absolute values or vice versa.

We must use sign-preserving and magnitude-preserving methods to normalize data with positive and negative values. Sign-preserving means that the sign of each value remains unchanged after normalization. Magnitude-preserving means that each value's relative order and distance remain unchanged after normalization.

One method that is sign-preserving and magnitude-preserving is decimal scaling. This method transforms each variable value by dividing by a power of 10 that is equal to or larger than the maximum absolute value of the variable. The result is a new variable ranging from -1 to 1. This means each value represents its distance from zero relative to the maximum absolute value.

To perform decimal scaling in R, we can use a custom function that takes a numeric vector or matrix as input and returns a scaled version based on decimal scaling.

For example:

```
# Define a function for decimal scaling
decimal_scale <- function(x) {
# Find the maximum absolute value of x
max_abs <- max(abs(x))
# Find the smallest power of 10 that is equal to or larger than max_abs
power <- ceiling(log10(max_abs))
# Divide x by 10^power
x / (10^power)
}
# Apply decimal scaling to the mpg variable of mtcars
mtcars$mpg_ds <- decimal_scale(mtcars$mpg)
# See the first six rows of the scaled mpg variable
head(mtcars$mpg_ds)
```

The output shows that the mpg variable has been transformed into a new variable called mpg_ds, which ranges from 0.21 to 0.181. The sign and magnitude of each value have been preserved.

We can also apply decimal scaling to all variables of the mtcars data set at once by passing the entire data frame to the decimal_scale() function. For example:```
# Apply decimal scaling to all variables of mtcars
mtcars_ds <- as.data.frame(decimal_scale(mtcars))
# See the structure and summary of the scaled mtcars data frame
str(mtcars_ds)
summary(mtcars_ds)
```

We can also visualize the effect of decimal scaling on the data by using some plots. For example, we can use the hist() function to create histograms for the original and scaled mpg variables:

```
# Create histograms for original and scaled mpg variables
par(mfrow = c(1, 2)) # Set up plot layout
hist(mtcars$mpg, main = "Original mpg", xlab = "Miles per gallon")
hist(mtcars_ds$mpg, main = "Scaled mpg", xlab = "Decimal values")
```

The histograms show that decimal scaling does not change the shape or distribution of the data but only scales it to a range from -1 to 1. The original mpg variable ranges from 10.4 to 33.9, while the scaled mpg variable ranges from 0.35 to 0.01. The mean of the original mpg variable is 20.09, while the mean of the scaled mpg variable is 0.20.

We can also use the boxplot() function to create boxplots for the original and scaled variables of the mtcars data set:

# Create boxplots for original and scaled variables of mtcars par(mfrow = c(2, 1)) # Set up plot layout boxplot(mtcars, main = "Original mtcars", las = 2) # Rotate labels by 90 degrees boxplot(mtcars_ds, main = "Scaled mtcars", las = 2) # Rotate labels by 90 degrees

The boxplots show that decimal scaling makes all variables have a range from -1 to 1. The original variables have different ranges, which makes them difficult to compare and interpret. The scaled variables range from -1 to 1, making them easier to compare and interpret.

## FAQs

**How do I normalize a dataset in R?**

There are different ways to normalize a dataset in R, depending on the type and range of the data and the purpose of the analysis. Some common methods are: read our blog.

**How do I normalize my data?**

Normalization is a general term that refers to transforming data values to a common scale or distribution. Depending on the context and the type of analysis you want to perform, you may need to use different normalization methods. Some common techniques are:

**Scaling to a range**: Using a linear transformation, this method converts the data values to a specified range, such as [0,1] or [-1,1]. This method is proper when you want to compare data that have different units or ranges.**Clipping**: This method caps all data values above or below a certain threshold to a fixed value. This method is proper when your data contains extreme outliers that may distort the analysis.**Log scaling**: This method applies the logarithm function to the data values to compress a wide range to a narrow range. This method is proper when your data follows a power law distribution or has a long tail.**Z-score standardization**: This method scales the data values such that they have a mean of zero and a standard deviation of one, using the formula: $$z_i = \frac{x_i - \bar{x}}{s}$$ where $z_i$ is the standardized value of $x_i$, $\bar{x}$ is the mean of the data, and $s$ is the standard deviation of the data. This method is proper when you want to remove the effect of different scales or units on your data and make it comparable to a normal distribution.

**Which R function do we use to normalize a series?**

There is no single R function that can normalize any series of data. Depending on the type and purpose of normalization you want to perform, you may need to use different functions or write your custom function. Some examples are:

- To scale a series to a range between 0 and 1, you can use the `scale` function with the `center` and `scale` arguments as the minimum and range of the series, respectively.
- To clip a series above or below a specific value, use the `pmin` or `pmax` functions with the `na.rm` argument as `TRUE`.
- To apply log scaling to a series, you can use the `log` function with an appropriate base argument (such as `log10` or `log2`).
- To standardize a series using z-scores, you can use the `scale` function with no arguments (the default behaviour is to center and scale by mean and standard deviation).

**How do you manually normalize a dataset?**

To manually normalize a dataset, apply an appropriate formula or transformation to each value. The procedure or transformation depends on the type and purpose of normalization you want to perform. Some examples are:

**Should I normalize my dataset?**

The answer to this question depends on several factors, such as:

- The type and distribution of your data
- The purpose and goal of your analysis
- The assumptions and requirements of your chosen model or method

In general, normalization can be beneficial for your dataset if it:

- Reduces or eliminates unwanted variability or noise in your data
- It makes your data more comparable or compatible across different sources or scales
- Improves the performance or accuracy of your model or method
- Simplifies or clarifies your data interpretation or presentation

However, normalization can also have some drawbacks or limitations, such as:

- Losing some information or meaning in your data
- Introducing some bias or distortion in your data
- Making your data less intuitive or natural to understand
- Adding some complexity or difficulty to your data processing or manipulation

Therefore, you should carefully consider the pros and cons of normalization for your specific dataset and analysis before applying it. You should also evaluate the normalisation results using appropriate metrics or criteria to ensure that it does not harm or degrade your data quality or analysis outcome.

**What are the four types of database normalization?**

Database normalization is designing a relational database schema that reduces data redundancy and improves data integrity. There are several levels or forms of database normalization, each with its rules and criteria. The four most common types of database normalization are:

- First Normal Form (1NF): A table is in 1NF if it has no repeating groups or arrays of values within a single record. Each attribute (column) should contain only atomic values (single values that cannot be further decomposed). Each record (row) should be unique and identified by a primary key (one or more attributes uniquely identifying a form).
- Second Normal Form (2NF): A table is in 2NF if it is in 1NF and has no partial dependencies. A partial dependence occurs when an attribute depends on only part of the primary key rather than the whole key. For example, if a table has a composite primary

**What is the difference between normalization and standardisation? **

Normalization and standardization transform data to have a common scale or range. However, they are different. NNormalizationusually refers to rescaling data with a minimum of zero and a maximum of one, while standardization usually refers to z-score normalization, which makes data have a mean of zero and a standard deviation of one.

**How do I choose which normalization method to use?**

There is no definitive answer to this question, as different methods may have different advantages and disadvantages depending on the data and the analysis.

Some factors that may influence your choice are: The type and distribution of the data. For example, suppose your data is skewed or has outliers. In that case, you may use z-score normalization decimal scaling, which is more robust to these issues. You may not need to normalise if your data is already bounded between zero and one.

The purpose and outcome of the analysis. For example, suppose you want to compare or combine data from different sources with different scales or units. In that case, you may use min-max normalization max_scale normalization, which gives data a common range. Suppose you want to perform statistical tests or machine learning algorithms that assume normality or equal variance. In that case, you may use z-score normalization decimal scaling, which makes data more normal or standardized.

The readability and interpretability of the data. For example, you want to make your data more readable or intuitive for humans. In that case, you may use min-max normalization max_scale normalization, which gives data values between zero and one or between zero and a specific value. Suppose you want to preserve the sign or magnitude of your data. In that case, you should use decimal scaling or max_scale normalization, which are sign-preserving and magnitude-preserving.

**How do I normalize categorical or binary variables?**

Categorical or binary variables have a finite number of discrete values, such as gender, colour, or yes/no. Normalizing these variables may make little sense, as they do not have a continuous scale or range. However, you may want to transform these variables into numeric values for analysis in some situations. For example:

- Suppose you have a binary variable representing presence or absence (0 or 1). In that case, you can leave it as it is or multiply it by a specific value to change its scale.
- Suppose you have an ordinal variable representing an ordered category (such as low, medium, or high). In that case, you can assign and normalise numeric values that reflect the order (1, 2, 3).
- Suppose you have a nominal variable representing an unordered category (red, green, or blue). In that case, you can use dummy coding or one-hot encoding to create binary variables for each category (such as red = 1, green = 0, blue = 0) and normalize them as usual.

**How do I check if my data is normalized?**

There are different ways to prevent if your data is normalized, depending on the method you used. For example:

- Using z-score normalization, you can check if your data has a mean of zero and a standard deviation of one using R's mean() and sd() functions.
- If you used min-max normalization, you could check if your data has a minimum of zero and a maximum of one using R's min() and max() functions.
- If you used range normalization
- If you used range normalization, you can check if your data has a range of one using the range() function in R.
- If you used decimal scaling, you can check if your data has a range from -1 to 1 using the range() function in R.
- If you used max_scale normalization, you can check if your data has a maximum of the desired value using R's max() function.

**How do I undo normalization? **

If you want to undo normalization and restore your data to its original values, you must apply the inverse of the normalization method you used. For example:

- If you used z-score normalization, multiply each value by the standard deviation and add the mean.
- If you used min-max normalization, multiply each value by the range (maximum-minimum) and add the minimum.
- If you used range normalization, you must multiply each value by the range (maximum-minimum).
- If you used decimal scaling, you need to multiply each value by a power of 10 that is equal to or larger than the maximum absolute value of the variable.
- If you used max_scale normalization, you must multiply each value by the actual maximum value and divide by the desired maximum value.

## Conclusion

In this article, we have learned how to normalize data in R using different methods and scenarios. We have seen how to use the scale, car, and dummy packages to perform z-score normalization, min-max normalization, range normalization, and decimal scaling.

Normalization is an essential step in data analysis, as it can help to reduce the effects of outliers, improve the performance of statistical tests and machine learning algorithms, and make the data more comparable and interpretable. However, normalization is not a one-size-fits-all solution, and we need to choose the appropriate method based on the data's type and distribution, the analysis's purpose, and the desired outcome.

We hope this article has helped you understand how to normalize data in R and apply it to your own data sets. If you have any questions or feedback, please get in touch with us at contact@rstudiodatalab.com or visit our website at Data Analysis.

Our data analysis website provides tutorials related to RStudio and other topics. We also offer professional services such as data analysis, visualization, mining, machine learning, and more. You can hire us for your projects or assignments by filling out our order form at Order Now.

We look forward to hearing from you soon!