Key Points
- Outliers are data points that are significantly different from the rest of the data and can affect the results of statistical tests and machine learning models.
- There are different ways to detect outliers, such as graphical methods (boxplots and histograms) and statistical methods (z-scores, interquartile range, Dixon’s test, and Rosner’s test).
- There are different ways to remove outliers from a dataset, such as using logical operators and subsetting, using the
subset()
function, or using thefilter()
function from thedplyr
package. - There are different ways to impute missing values in a dataset, such as mean, median, or mode imputation, multiple imputations by chained equations (MICE), or K-nearest neighbours (KNN) imputation.
- There are different ways to encode categorical variables in a dataset, such as label encoding, one-hot encoding, or ordinal encoding.
Description of Functions and Packages
Function/Package | Description |
---|---|
boxplot() |
Creates a boxplot for a numeric variable |
hist() |
Creates a histogram for a numeric variable |
scale() |
Calculates z-scores for a numeric variable |
IQR() |
Calculates interquartile range for a numeric variable |
outlierTest() from the car package |
Performs Dixon’s test for one outlier in a small dataset |
rosnerTest() from the EnvStats package |
Performs Rosner’s test for multiple outliers in a large dataset |
subset() |
Creates a new dataset that contains only the rows that meet a certain criterion |
filter() from the dplyr package |
Creates a new dataset that excludes the rows that match a certain condition |
na.mean() , na.median() , na.mode() from the imputeTS package |
Performs mean, median, or mode imputation for missing values |
mice() from the mice package |
Performs multiple imputation by chained equations (MICE) for missing values |
knnImputation() from the DMwR package |
Performs K-nearest neighbours (KNN) imputation for missing values |
as.numeric() or as.factor() |
Converts a variable into numeric or factor type, respectively |
model.matrix() |
Creates a design matrix with dummy variables for each category of a factor variable |
factor() |
Creates an ordered factor variable with specified levels |
I’m Zubair Goraya, a Ph.D. Scholar, Certified data analyst, and Freelancer, and I love to share my knowledge and experience with R programming. In this article, I’m going to show you how to deal with outliers in data using R.
Outliers are data points that are significantly different from the rest of the data and can affect the results of statistical tests and machine learning models. Therefore, it is important to identify and remove outliers before performing any analysis on the data.
Here are the main topics that I will cover in this article:
- What is an outlier, and how do we detect it using boxplots and histograms?
- How to find outliers using z-scores, interquartile range, Dixon’s and Rosner’s test?
- How do we remove outliers from data using Rstudio functions?
- How do we impute missing values and handle categorical variables in data?
- How do we check the presence of outliers after data cleaning?
By the end of this article, you will have a clear understanding of how to perform outlier analysis in R and improve the quality of your data. You will also learn some useful tips and tricks for data science and machine learning projects. So, let’s get started!
What is an outlier?
An outlier is a value that is very different from the other values in a dataset. For example, if you have a dataset of heights of people, a value of 2 meters or 50 centimetres would be considered as an outlier. Outliers can be caused by various factors such as measurement errors, data entry errors, natural variability, or rare events.
Outlier Detection
One way to detect outliers is to use graphical methods such as boxplots and histograms. A boxplot is a type of plot that shows the distribution of a numeric variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. The box represents the middle 50% of the data, while the whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR). Any value beyond the whiskers is considered an outlier and is marked with a dot or a circle.
A histogram is another type of plot that shows the frequency of values in a numeric variable using bars. The height of each bar represents the number of observations in each bin or interval. A histogram can also help to identify outliers by showing the shape and spread of the data. If the data is skewed or has long tails, it may indicate the presence of outliers.
How do we detect it using boxplots and histograms?
To create boxplots and histograms in R, you can use the boxplot()
and hist()
functions, respectively. For example, let’s use the mtcars
dataset that comes with R and create boxplots and histograms for the hp
variable.
# Load the mtcars dataset data(mtcars) # Create a boxplot for mpg boxplot(mtcars, main = "Boxplot for Mtcars Data set", xlab = "mpg", ylab = "Frequency")
boxplot(mtcars$hp, main = "Boxplot for hp", xlab = "hp", ylab = "Frequency")
hist(mtcars$hp, main = "Histogram for hp", xlab = "hp", ylab = "Frequency")
From these plots, we can see that there is one outlier in the hp variable: one with a very high value (around 350). These values are far away from the rest of the data and may affect the mean and standard deviation of the variable.
How do we find outliers using statistical methods?
Another way to find outliers in R is to use statistical methods that calculate a measure of deviation or distance from the center or average of the data. Some common methods are:
- Z-scores: It is a standardized score that measures how many standard deviations a value is away from the mean. A value with a z score greater than 3 or less than -3 is usually considered as an outlier.
- Interquartile range (IQR): The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) of a variable. It represents the middle 50% of the data. A value with an IQR score greater than 1.5 times the IQR above Q3 or below Q1 is usually considered as an outlier.
- Dixon’s test: It is a method that detects one outlier at a time in a small dataset (less than 30 observations). It compares the ratio of the gap between the outlier and its nearest neighbour to the range of the data. If this ratio exceeds a critical value based on the sample size and significance level, then the value is an outlier.
- Rosner’s test: It is a method that detects multiple outliers at a time in a large dataset (more than 30 observations). It is an iterative procedure that starts with Dixon’s test and then removes the detected outlier, and repeats the test until no more outliers are found.
To apply these methods in R, you can use the following functions:
scale()
to calculate z-scoresIQR()
to calculate IQR scoresoutlierTest()
from thecar
package to perform Dixon’s testrosnerTest()
from theEnvStats
package to perform Rosner’s test
Find Outliers in RStudio
For example, let’s use these functions to find outliers in the mpg
variable of the mtcars
dataset. If packages were not installed, first install the packages and then load the libraries. don't know How to Import and Install Packages in R: A Comprehensive Guide
Outlier Detection using Z-scores
Z-scores, also known as standard scores, z-values, normal scores, z-scores or standardized values, measure how many standard deviations away a value is from the mean of a distribution. They are useful for comparing data with different units, scales, or ranges. They can also help us test a dataset's normality, find outliers, and calculate probabilities. Read more: Did You Know How to Calculate Z-Score in R?
# Load the car and EnvStats packages install.packages("car") install.packages("EnvStats") library(car) library(EnvStats) # Calculate z-scores for mpg z_scores <- scale(mtcars$hp) z_scores # Find values with z-scores greater than 2 or less than -2 z_outliers <- mtcars$hp[abs(z_scores) > 2] # Print z_outliers z_outliers
[1] 335
This means that the value 335 is an outlier according to the z-score method.
Inter Quartile Range (IQR)
# Inter Quartile Range (IQR) # Calculate IQR scores for hp iqr_scores <- (mtcars$hp - median(mtcars$hp)) / IQR(mtcars$hp) # Find values with IQR scores greater than 1.5 or less than -1.5 iqr_outliers <- mtcars$hp[abs(iqr_scores) > 1.5] # Print iqr_outliers iqr_outliers
[1] 264 335
This means that the same values are outliers according to the IQR method.
Outlier Detection Using Dixon’s test
Dixon test has some limitation, this function is only handle the sample size of 3 to 30. If the sample size is large, it did not work. So, for this tutorial, we subset the data set and take the only first 30 observations.
# Load the outliers package library(outliers) # Perform Dixon's test for hp dixon_test <- outliers::dixon.test(mtcars$hp[1:30]) # Print dixon_test print(dixon_test)
This means that the value for 264 is an outlier according to Dixon’s test at a significance level of 0.05. Our Results were not reliable because I had a large sample size.
Outlier Detection Using Rosner's test
# Perform Rosner's test for hp rosner_test <- rosnerTest(mtcars$hp, k = 2) # Print rosner_test rosner_test
This means that the values of 335 and 264 are outliers at a significance level of 0.05.
How do we remove outliers from data?
Once you have identified the outliers in your dataset, remove them before performing any further analysis of the data. There are different ways to remove outliers from data, such as:
Using logical operators and subsetting
You can use logical operators such as <
, >
, ==
, !=
, etc., to create a condition that filters out the outliers from your dataset and then uses subsetting to select only the rows that satisfy the condition.
Using the subset()
function: You can use the subset()
function to create a new dataset that contains only the rows that meet a certain criterion and exclude the outliers.
Using the `filter ()
function from the dplyr
package
You can use the filter()
function from the dplyr
package to create a new dataset that excludes the rows that match a certain condition and keeps the rest of the data.
For example, let’s use these methods to remove the outliers from the mpg
variable of the mtcars
dataset.
# Load the dplyr package library(dplyr) # Remove outliers using logical operators and subsetting mtcars_no_outliers1 <- mtcars[mtcars$hp > 300 & mtcars$mpg < 300, ] mtcars_no_outliers1 # Remove outliers using the subset() function mtcars_no_outliers2 <- subset(mtcars, hp > 300 & mpg < 300) mtcars_no_outliers2 # Remove outliers using the filter() function from the dplyr package mtcars_no_outliers3 <- filter(mtcars, hp > 300, mpg < 300) mtcars_no_outliers3
The output of these commands is a new dataset that contains only 30 rows and excludes the outliers from the mpg
variable.
How to impute missing values and handle categorical variables in a dataset?
Another common issue that you may encounter in your dataset is missing values. Missing values are values that are not recorded or available for some reason. They can be represented by symbols such as NA, NULL, or ?. It can affect the accuracy and validity of your analysis and may introduce bias or errors in your results.
One way to deal with this is to impute them. Imputation is a process of replacing missing values with plausible values based on some criteria or assumptions.
There are different methods, such as:
Mean or median imputation
This method replaces missing values with the mean or median of the variable. It is simple and easy to implement, but it may reduce the variability and distort the distribution of the data.
Mode or most frequent imputation
This method replaces missing values with the mode or most frequent value of the variable. It is suitable for categorical variables, but it may introduce bias and overrepresent some categories.
Regression imputation
This method replaces missing values with predicted values based on a regression model that uses other variables as predictors. It is more sophisticated and realistic, but it may increase the complexity and uncertainty of the model.
K-nearest neighbours (KNN) imputation
This method replaces missing values with the average or weighted average of the k nearest neighbours of the observation based on some distance metric. It is more flexible and adaptive, but it may be computationally expensive and sensitive to outliers.
To perform imputation in R, you can use various functions and packages, such as:
na.mean()
,na.median()
,na.mode()
from theimputeTS
package to perform mean, median, or mode imputationmice()
from themice
package to perform multiple imputation by chained equations (MICE), which is a general method that can handle different types of variables and modelsknnImputation()
from theDMwR
package to perform KNN imputation
Imputation of Missing Values in R
# Load the imputeTS, mice, and DMwR packages #install.packages("imputeTS") library(imputeTS) #install.packages("mice") library(mice) #install.packages("DMwR2") library(DMwR2) # Create a simulated dataset with numeric and categorical variables set.seed(123) df <- data.frame( x = rnorm(100, mean = 50, sd = 10), y = sample(c("A", "B", "C"), 100, replace = TRUE), z = runif(100, min = 0, max = 100) ) # Introduce some missing values randomly df[sample(1:100, 10), "x"] <- NA df[sample(1:100, 10), "y"] <- NA df[sample(1:100, 10), "z"] <- NA # Check which columns contain missing values colSums(is.na(df))
Impute Missing values
Let’s use the imputeTS package to perform mean, median, and mode imputation for the numeric and categorical variables, respectively.
# Load the imputeTS package library(imputeTS) # Perform mean imputation for x and z variables df$x <- na_mean(df$x) df$z <- na_mean(df$z) # Perform mode imputation for y variable # Custom function to impute mode for a vector impute_mode <- function(x) { uniq_x <- unique(x) table_x <- table(x) mode_val <- uniq_x[which.max(table_x)] x[is.na(x)] <- mode_val return(x) } # Impute mode for the 'y' variable df$y <- impute_mode(df$y) # Print df after imputation head(df,10) # Check which columns contain missing values colSums(is.na(df))
As you can see, this dataset has no more missing values in any of the variables: x, y, and z.
Handle categorical variables in R.
Another issue that you may face in your dataset is the presence of categorical variables. Categorical variables are variables that have a finite number of possible values or categories, such as gender, colour, or type of car.
Categorical variables can be either nominal or ordinal.
- Nominal variables are variables that have no inherent order or ranking among the categories, such as gender or color.
- Ordinal variables are variables that have a natural order or ranking among the categories, such as education level or satisfaction rating.
One way to deal with categorical variables is to encode them into numeric values that can be used for analysis and modelling.
There are different methods of encoding categorical variables, such as:
- Label encoding: This method assigns a unique integer value to each category of the variable, starting from zero or one.
- One-hot encoding: This method creates a new binary variable for each category of the variable, with a value of one if the observation belongs to that category and zero otherwise.
- Ordinal encoding: This method assigns an integer value to each category of the variable based on the order or ranking of the categories.
Encoding categorical variables in R
To perform encoding in R, you can use various functions and packages, such as:
as.numeric()
oras.factor()
to convert a variable into numeric or factor type, respectively.model.matrix()
to create a design matrix with dummy variables for each category of a factor variable.factor()
to create an ordered factor variable with specified levels.
For example, let’s use these functions to encode the y variable of the df dataset that we created earlier.(
# Encode y variable using label encoding df$y<-as.factor(df$y) df$y_label <- as.numeric(df$y) -1 # Encode y variable using one-hot encoding df$y_onehot <- model.matrix(~ y -1, data = df) # Encode y variable using ordinal encoding df$y_ordinal <- factor(df$y, levels = c("A", "B", "C"), ordered = TRUE) # Print df after encoding df
I have encoded the y variable using three different methods:
- Label Encoding
- One-hot encoding
- Ordinal encoding
Label encoding
It assigns a unique integer value to each category of the variable, starting from zero or one. For example, category A is encoded as zero, B as one, and C as two.
One-hot encoding
It creates a new binary variable for each category of the variable, with a value of one if the observation belongs to that category and zero otherwise. For example, the category A is encoded as a vector of (1,0,0), B as (0,1,0), and C as (0,0,1).
Ordinal encoding
It assigns an integer value to each category of the variable based on the order or ranking of the categories. For example, category A is encoded as one, B as two, and C as three.
You can see the results of each encoding method in the new columns that I have added to the dataset: y_label, y_onehotA, y_onehotB, y_onehotC, and y_ordinal.
Encoding categorical variables can help to transform them into numeric values that can be used for analysis and modelling. However, it would be best if you were careful about choosing the appropriate method for your data and your purpose.
Advantages and Disadvantages of each method
Some advantages and disadvantages of each method are:
- Label encoding is simple and easy to implement, but it may imply a false sense of order or magnitude among the categories that may not exist in reality.
- One-hot encoding is more expressive and avoids the problem of order or magnitude, but it may create a large number of new variables that may increase the dimensionality and sparsity of the data.
- Ordinal encoding is suitable for ordinal variables that have a natural order or ranking among the categories, but it may not work well for nominal variables that have no inherent order or ranking.
Conclusion
This article shows you how to perform outlier analysis and imputation in R using various methods and functions. You have learned how to identify and remove outliers, how to replace missing values with plausible values, and how to transform categorical variables into numeric values. These steps can help you to improve the quality of your data and prepare it for further analysis and modelling.
If you want to learn more about R programming and data analysis, you can check out our latest R posts on our website: Data Analysis. You can also contact us at info@rstudiodatalab.com or hire us at Order Now if you need any help with your data science or machine learning projects.
Thank you for reading, and happy coding!
Frequently Asked Questions (FAQs)
What is an outlier?
An outlier is a data point that is significantly different from the rest of the data and can affect the results of statistical tests and machine learning models.
How can I detect outliers using boxplots?
A boxplot is a type of plot that shows the distribution of a numeric variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. The box represents the middle 50% of the data, while the whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR). Any value beyond the whiskers is considered an outlier and is marked with a dot or a circle.
How can I detect outliers using z-scores?
A z-score is a standardized score that measures how many standard deviations a value is away from the mean. A value with a z-score greater than 3 or less than -3 is usually considered an outlier.
How can I remove outliers from a dataset using logical operators and subsetting?
You can use logical operators such as <
, >
, ==
, !=
, etc., to create a condition that filters out the outliers from your dataset and then uses subsetting to select only the rows that satisfy the condition. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:
mtcars_no_outliers <- mtcars[mtcars$mpg > 10 & mtcars$mpg < 34, ]
How can I remove outliers from a dataset using the subset() function?
You can use the subset() function to create a new dataset that contains only the rows that meet a certain criterion and exclude the outliers. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:
mtcars_no_outliers <- subset(mtcars, mpg > 10 & mpg < 34)
How can I remove outliers from a dataset using the filter() function from the dplyr package?
You can use the filter() function from the dplyr package to create a new dataset that excludes the rows that match a certain condition and keeps the rest of the data. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:
library(dplyr) mtcars_no_outliers <- filter(mtcars, mpg > 10, mpg < 34)
What is imputation?
Imputation is a process of replacing missing values with plausible values based on some criteria or assumptions.
How can I impute missing values using mean, median, or mode imputation?
You can use the na.mean(), na.median(), or na.mode() functions from the imputeTS package to perform mean, median, or mode imputation for missing values. For example, if you want to impute missing values in the x variable of the df dataset using mean imputation, you can use this command:
library(imputeTS) df$x <- na_mean(df$x)
How can I impute missing values using multiple imputations by chained equations (MICE)?
You can use the mice() function from the mice package to perform multiple imputation by chained equations (MICE) for missing values. MICE is a general method that can handle different types of variables and models. For example, if you want to impute missing values in the df dataset using MICE, you can use this command:
library(mice) df_imputed <- mice(df)
How can I encode categorical variables into numeric values?
You can use various methods to encode categorical variables into numeric values, such as label encoding, one-hot encoding, or ordinal encoding. For example, if you want to encode the y variable of the df dataset using one-hot encoding, you can use this command:
df$y_onehot <- model.matrix(~ y -1, data = df)