Create Stunning Data Visualization in R

In this article, I will share my learning journey of Data Visualization in R and the use of ggplot2 to create amazing plots with minimal code.

Key Points

  • Data visualization is the art and science of presenting data visually, making it easy to understand and explore.
  • R is a powerful and popular programming language for data analysis and visualization, and ggplot2 is one of the most popular and powerful packages for data visualization in R.
  • ggplot2 is based on the grammar of graphics, a framework that defines the components and rules of a graphic and allows you to create a wide range of plots with minimal code and high customization.
  • In this article, I will show you how to use ggplot2 to create some common types of data visualizations, such as bar charts, histograms, scatter plots, line charts, and box plots, and how to customize them with titles, labels, legends, themes, and colors.
  • I will also share some tips and tricks I learned along the way and resources that helped me improve my data visualization skills in R.
How I Learned to Create Stunning Data Visualization in R
Table of Contents

Packages and Functions Used in the Article

Package Function Description
tidyverse install.packages("tidyverse") Installs the tidyverse package
library(tidyverse) Loads the tidyverse package
ggplot2 qplot(x, y, data, ...) Quick plot with x and y variables from a data frame
ggplot(data, aes(x, y, ...)) Creates a ggplot object with data and aesthetic mappings
geom_*() Adds a geometric object to a ggplot
labs(title, subtitle, caption, x, y, …) Adds titles, labels, and legends to a ggplot
theme() Modifies the appearance of a ggplot
scale_color_*() Changes the color palette of a ggplot
facet_wrap(~ var) Creates multiple plots based on different subsets of a variable
ggsave(filename, plot, …) Saves a ggplot as an image file, such as PNG, JPEG, PDF, etc.

Hi, I’m Zubair Goraya, a Ph.D. scholar, a certified data analyst, and a freelancer with five years of experience. I have used R for data analysis and visualization for over five years. I have worked on various projects involving data from different domains and taught R courses and workshops to students and professionals.

Data visualization is the art and science of presenting data visually, making it easy to understand to readers, and also helping you to communicate your findings.

But how do you create effective data visualizations in R?

Many packages are available in R for data visualization, but one of the most popular and powerful libraries is ggplot2. ggplot2 is a part of the tidyverse, based on the grammar of graphics, a framework that defines the components and rules of a graphic.

We will explore how to use ggplot2 to create data visualizations, such as
  • Bar charts
  • Histograms
  • Scatter plots
  • Line charts
  • Box plots.

I will also show you how to add titles, labels, legends, themes, and colors to your plots.

My First Plot Using R

The first time I tried to create a plot in R, I took a statistics and data analysis course during my undergraduate studies. The instructor asked us to use R to create a simple scatter plot of two variables from a data set, briefly introduced R, and showed us how to use the plot function to create a basic plot.

I was excited to try it out, so I opened RStudio and typed the following code:

plot(x = data$var1, y = data$var2)

I expected to see a nice scatter plot on the screen, but instead, I got this:

A plot with no title, labels, legend, or color

I was disappointed and confused because the plot looked dull and did not convey useful information. It had no title, labels, legend, or color and just a bunch of black dots on a white background. I wondered if I had done something wrong or if there was a better way to create a plot in R.

Then I decided to research online and discovered that many functions could be used to create plots in R, such as hist, barplot, boxplot, lattice, and ggplot2.

I was curious to learn more about them and experimented with different options and parameters. I soon realized that creating a plot in R was more complex than I thought. 

There were many choices and decisions, such as:

  • What type of plot should I use for my data?
  • How should I map my variables to the plot elements, such as the x-axis, the y-axis, the color, the shape, and the size?
  • How should I customize the appearance of my plot, such as the title, the labels, the legend, the theme, and the color palette?
  • How should I save and export my plot as an image file?
I was overwhelmed by the amount of information and options available, and I often got frustrated by the errors and warnings I encountered. I spent more time debugging and tweaking my code than analyzing and visualizing my data. Is there a better and easier way to create plots in R.

That’s when I discovered ggplot2, and everything changed.

Related Posts

What is ggplot2 and Why You Should Use It

ggplot2 is a package for data visualization in R created by Hadley Wickham, and it is part of the tidyverse, a collection of packages that work well together for data analysis and manipulation. It is based on the grammar of graphics, a framework that defines the components and rules of a graphic, proposed by Leland Wilkinson.

The grammar of graphics is a powerful and elegant way to think about and create graphics. It allows you to decompose any graphic into its essential elements, such as:

  • Data: the raw or processed data that you want to visualize
  • Aesthetics: the visual properties that you want to map to your data, such as the x-axis, the y-axis, the color, the shape, and the size
  • Geoms: the geometric objects that you want to use to represent your data, such as points, bars, lines, and boxes
  • Scales: the functions that transform your data values into aesthetic values, such as linear, logarithmic, categorical, and continuous scales
  • Facets: the ways that you want to split your data into subsets and display them as multiple plots
  • Coordinates: the systems that you want to use to define the position of your geoms, such as cartesian, polar, and map coordinates
  • Themes: the elements that you want to use to modify the appearance of your plot, such as the background, the grid lines, the text, and the legend

With ggplot2, you can create any plot by combining these elements with a simple and consistent syntax. You can also easily modify and customize your plot by adding or changing any element. ggplot2 takes care of the details and produces high-quality graphics ready for publication or presentation.

ggplot2 has many advantages over the base R plotting functions, such as:

  • Expressive and concise: you can create complex, customized plots with minimal code.
  • Using the same syntax and structure for any plot is more consistent and logical.
  • Flexible and powerful: you can create a wide range of plots with high customization.
  • Interactive and dynamic: you can explore and modify your plot interactively with the RStudio viewer or the ploty.

How to Create Basic Plots with ggplot2

I will show you how to create basic plots with ggplot2. I will use a sample data set called mpg, which is included in the ggplot2 package. 

Overview of the data

The mpg data set contains information about the fuel economy of 234 cars from different manufacturers, models, and years. The data set has 11 variables: manufacturer, model, year, class, cty, hwy, etc.

Load the data set and Required Packages.

To use the mpg data set, you must load the ggplot2 package first. You can do that by running the following command in your R console:

library(ggplot2)

You can also load the entire tidyverse package, which includes ggplot2 and other packages, by running the following command:

library(tidyverse)

To see the first few rows and dimensions of the mpg data set,  you can run the following command:

head(mpg)
dim(mpg)

The output will look like this:

first few rows and dimension of the mpg data set

To see the structure and summary of the mpg data set, you can run the following commands:

str(mpg)
summary(mpg)

The output of structure and summary or descriptive statistics was shown below:

The output of structure and summary or descriptive statistics was shown below:

Create a plot with ggplot2

To create a plot with ggplot2, you need to use two main functions. 

  • The ggplot function creates a ggplot object with the data and aesthetic mappings you want for your plot. 
  • The geom function adds a geometric object to the ggplot object, such as points, bars, lines, and boxes. You can create a plot by combining these functions with the + operator.

Create a Scatter Plot using ggplot2

For example, to create a scatter plot of the city miles per gallon (cty) and the highway miles per gallon (hwy) of the cars in the mpg data set, you can use the following code:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()
    A scatter plot of cty and hwy with black points
  1. The first argument of the ggplot function is the data frame you want to use for your plot, in this case, mpg. 
  2. The second argument is the aesthetic mapping you want to use for your plot, which the aes function specifies. The aes function takes the variables you want to map to the plot elements, such as the x-axis, the y-axis, the color, the shape, and the size. In this case, we map the cty variable to the x-axis and the hwy variable to the y-axis.

The + operator adds a geom function to the ggplot object, specifying the plot type you want to create. In this case, we use the geom_point function, which adds points to the plot. The geom_point function can also take additional arguments, such as the color, the shape, and the size of the points, which can be either fixed values or mapped to variables.

It is a basic scatter plot that shows the relationship between the city and highway miles per gallon of cars. You can see a positive correlation between the two variables, meaning that the cars with higher city miles per gallon also have higher highway miles per gallon. You can also see some variation in the data, meaning that some cars have higher or lower miles per gallon than others.

However, the plot still needs to be more informative and attractive. It has no title, label, legend, or color. It is just a bunch of black points on a white background. How can we improve it? Let’s see how we can add titles, labels, legends, and colors to our plot.

How to Customize Your Plots with ggplot2

I will show you how to customize your plots with ggplot2, such as adding titles, labels, legends, themes, and colors. I will use the same scatter plot of the city and highway miles per gallon of the cars in the mpg data set I created in the previous section.

To add titles, labels, and legends to your plot, you can use the labs function in the ggplot2 package. The labs function takes the following arguments:

  • title: the main title of the plot
  • subtitle: the subtitle of the plot
  • caption: the caption or source of the plot
  • x: the label of the x-axis
  • y: the label of the y-axis
  • fill: the label of the fill legend
  • color: the label of the color legend
  • shape: the label of the shape legend
  • size: the label of the size legend
For example, to add a title, a subtitle, a caption, and an x-axis and y-axis label to our plot, we can use the following code:
A scatter plot of cty and hwy with black points and titles, labels, and caption

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  labs(title = "City vs Highway Miles per Gallon of Cars",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package, rstudiodatalab.com",
       x = "City miles per gallon",
       y = "Highway miles per gallon")

The plot looks much better than the previous one. It has a title, a subtitle, a caption, and an x-axis and y-axis label that describe the plot and the data. However, it still has no legend and no color. How can we add them?

How to add Colors in the ggplot2 plot

To add a legend and a color to our plot, we need to map another variable to the color aesthetic in the aes function. For example, if we want to color the points by the manufacturer of the cars, we can use the following code:

A scatter plot of cty and hwy with points colored by manufacturer and a legend

ggplot(mpg, aes(x = cty, y = hwy, color = manufacturer)) +
  geom_point() +
  labs(title = "City vs Highway Miles per Gallon of Cars",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package, rstudiodatalab.com",
       x = "City miles per gallon",
       y = "Highway miles per gallon")
The plot looks even better than the previous one. It has a legend and a color that shows the car's manufacturer. You can see 15 manufacturers in the data set, each with a different color. You can also see that some manufacturers have higher or lower miles per gallon than others and that there is some variation within each manufacturer.

However, you may prefer a different color palette than ggplot2 uses for the plot. How can we change it?

How to Change color palette in ggplot2

To change the color palette of the plot, we can use the scale_color function in the ggplot2 package. The scale_color function allows us to change the plot's color palette or assign specific colors to specific values. There are many types of scale_color functions available, such as:

  • scale_color_viridis: uses a color palette from the viridis package, which is a set of perceptually uniform and colorblind-friendly palettes
  • scale_color_brewer: uses a color palette from the RColorBrewer package, which is a set of palettes designed by Cynthia Brewer for thematic maps
  • scale_color_manual: uses a color palette that we specify manually by providing a vector of colors or color names
  • scale_color_gradient: uses a color palette that interpolates between two colors
  • scale_color_gradient2: uses a color palette that interpolates between three colors, with a midpoint
  • scale_color_gradientn: uses a color palette that interpolates between n colors
For example, if we want to use a color palette from the viridis package, we can use the following code:

A scatter plot of cty and hwy with points colored by manufacturer and a legend, using a viridis color palette

ggplot(mpg, aes(x = cty, y = hwy, color = manufacturer)) +
  geom_point() +
  labs(title = "City vs Highway Miles per Gallon of Cars",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package, redesgined by rstudiodatalab.com",
       x = "City miles per gallon",
       y = "Highway miles per gallon") +
  scale_color_viridis_d()

The scale_color_viridis_d function uses a discrete color palette from the viridis package, which is suitable for categorical variables. The plot looks different than the previous one. It has a different color palette that is more pleasing to the eye and more friendly to colorblind people. You can also see that the legend has changed accordingly.

Related Posts

How to change the theme of the ggplot2 graph

We can use the theme function in the ggplot2 package to change the plot's theme. The theme function allows us to modify the appearance of the plot elements, such as the background, the grid lines, the text, and the legend. There are many types of theme functions available, such as:

  • theme_minimal: uses a minimal theme with no background and no grid lines
  • theme_classic: uses a classic theme with a white background and no grid lines
  • theme_dark: uses a dark theme with a black background and white grid lines
  • theme_bw: uses a black and white theme with a white background and black grid lines
  • theme_light: uses a light theme with a light gray background and white grid lines
  • theme_gray: uses a gray theme with a gray background and white grid lines
For example, if we want to use a minimal theme for our plot, we can use the following code:
A scatter plot of cty and hwy with points colored by manufacturer and a legend, using a viridis color palette and a minimal theme

ggplot(mpg, aes(x = cty, y = hwy, color = manufacturer)) +
  geom_point() +
  labs(title = "City vs Highway Miles per Gallon of Cars",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package, redesgined by rstudiodatalab.com",
       x = "City miles per gallon",
       y = "Highway miles per gallon") +
  scale_color_viridis_d() +
  theme_minimal()

The theme_minimal function uses a minimal theme with no background and grid lines. The plot looks cleaner and simpler than the previous one. It has no background or grid lines, making the points and the colors stand out more. You can also see that the text and the legend have changed accordingly.

By now, you should know how to create and customize a scatter plot with ggplot2. You can apply the same principles and functions to create other types of plots, such as bar charts, histograms, line charts, and box plots. Let’s see how we can do that in the next section.

How to Create Other Types of Plots with ggplot2

I will show you how to create other types of plots with ggplot2, such as bar charts, histograms, line charts, and box plots. I will use the same mpg data set that I used in the previous section.

Bar Charts

A bar chart is a type of data visualization in R that shows the counts or proportions of a categorical variable or the comparison of a continuous variable across different categories. To create a bar chart with ggplot2, you can use the geom_bar function. The geom_bar function takes the following arguments:

  • x: the variable that you want to map to the x-axis, which is usually a categorical variable
  • y: the variable that you want to map to the y-axis, which is usually a continuous variable or a count of the x variable
  • stat: the statistical transformation that you want to apply to the y variable, which can be either "identity" (no transformation), "count" (count the number of cases in each group), or "proportion" (calculate the proportion of cases in each group)
  • position: the position adjustment that you want to apply to the bars, which can be either "stack" (stack the bars on top of each other), "dodge" (place the side of the bar by the side), or "fill" (stack the bars and normalize them to have constant height)
For example, to create a bar chart of the number of cars by manufacturer in the mpg data set, you can use the following code:
A bar chart of the number of cars by manufacturer

ggplot(mpg, aes(x = manufacturer)) +
  geom_bar()

  • The first argument of the ggplot function is the data frame you want to use for your plot, in this case, mpg. 
  • The second argument is the aesthetic mapping you want to use for your plot, which the aes function specifies. The aes function takes the variable you want to map to the x-axis, in this case, the manufacturer.
The + operator adds a geom function to the ggplot object, specifying the plot type you want to create. In this case, we use the geom_bar function, which adds bars to the plot. The geom_bar function can also take additional arguments, such as the y variable, the stat, and the position, but we leave them as default for now.

A basic bar chart shows the number of cars by manufacturer in the mpg data set. You can see 15 manufacturers in the data set, each with a different number of cars. You can also see that the bars are stacked on each other, and the y-axis shows the count of the x variable.

However, this plot still needs to be more informative and attractive. It has no title, label, legend, or color. It is just a bunch of gray bars on a white background. How can we improve it?

How to add title, label, legend, or color in bar Chart using ggplot2

We can improve it by adding titles, labels, legends, themes, and colors to our plot, just like we did for the scatter plot in the previous section. For example, to add a title, a subtitle, a caption, an x-axis and y-axis label, a color, and a minimal theme to our plot, we can use the following code:

ggplot(mpg, aes(x = manufacturer, fill = manufacturer)) +
  geom_bar() +
  labs(title = "Number of Cars by Manufacturer",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "Manufacturer",
       y = "Number of cars",
       fill = "Manufacturer") +
  scale_fill_viridis_d() +
  theme_minimal()

The code above is similar to the code that we used for the scatter plot, except for two changes:A bar chart of the number of cars by manufacturer, with titles, labels, legend, color, and minimal theme

We map the manufacturer variable to the fill aesthetic in the aes function, which fills the bars with different colors according to the manufacturer.

We add a fill argument to the labs function, which adds a label to the fill legend.

The plot looks much better than the previous one. It has a title, a subtitle, a caption, an x-axis and y-axis label, a legend, and a color that describes the plot and the data. It also has a minimal theme that makes the plot cleaner and simpler.

How to Change the stat and position of the bar chart

To change the stat and position of the bar chart, we can use the stat and position arguments in the geom_bar function. For example, if we want to show the proportion of cars by manufacturer instead of the number of cars, we can use the following code:

ggplot(mpg, aes(x = manufacturer, fill = manufacturer)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of Cars by Manufacturer",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "Manufacturer",
       y = "Proportion of cars",
       fill = "Manufacturer") +
  scale_fill_viridis_d() +
  theme_minimal()

The stat argument in the geom_bar function specifies the statistical transformation we want to apply to the y variable. A bar chart of the proportion of cars by manufacturer, with titles, labels, legend, color, and minimal themeIn this case, we use the "proportion" stat, which calculates the proportion of cases in each group. 
The plot shows the proportion of cars by manufacturer in the mpg data set. You can see that the y-axis shows the proportion of the x variable, which ranges from 0 to 1. You can also see that the bars are still stacked on top of each other, but they have different heights according to the proportion of each manufacturer.

Histograms

A histogram is a type of plot showing a continuous variable'svariable's distribution. To create a histogram with ggplot2, you can use the geom_histogram function. The geom_histogram function takes the following arguments:

  • x: the variable that you want to map to the x-axis, which is usually a continuous variable
  • y: the variable that you want to map to the y-axis, which is usually a count of the x variable or a density of the x variable
  • stat: the statistical transformation that you want to apply to the y variable, which can be either "identity" (no transformation), "count" (count the number of cases in each bin), or "density" (calculate the density of cases in each bin)
  • binwidth: the width of the bins that you want to use for the histogram, which can be either a fixed value or a function that calculates the optimal bin width
  • position: the position adjustment that you want to apply to the bars, which can be either "stack" (stack the bars on top of each other), "dodge" (place the side of the bar by side), or "identity" (place the bars at their original position)
For example, to create a histogram of the city miles per gallon (cty) of the cars in the mpg data set, you can use the following code:

ggplot(mpg, aes(x = cty)) +
  geom_histogram()

  • The first argument of the ggplot function is the data frame you want to use for your plot, in this case, mpg. 

A histogram of the city miles per gallon of the cars

  • The second argument is the aesthetic mapping you want to use for your plot, which the aes function specifies. The aes function takes the variable you want to map to the x-axis, in this case, cty.
The + operator adds a geom function to the ggplot object, specifying the plot type you want to create. In this case, we use the geom_histogram function, which adds bars to the plot.

The geom_histogram function can also take additional arguments, such as the y variable, the stat, the bin width, and the position, but we leave them as default for now.

It is a basic histogram that shows the distribution of the city miles per gallon of the cars in the mpg data set. You can see that the x-axis shows the range of the cty variable, and the y-axis shows the count of the x variable. You can also see that the bars are stacked on each other, and the bin width is automatically calculated by ggplot2.

How to add title, label, legend, or color in the histogram using ggplot2

We can improve it by adding titles, labels, legends, themes, and colors to our plot, just like we did for the scatter plot and the bar chart in the previous sections. For example, to add a title, a subtitle, a caption, an x-axis and y-axis label, a color, and a minimal theme to our plot, we can use the following code:

ggplot(mpg, aes(x = cty, fill = cty)) +
  geom_histogram() +
  labs(title = "Histogram of City Miles per Gallon of Cars",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "City miles per gallon",
       y = "Number of cars",
       fill = "City miles per gallon") +
  scale_fill_viridis_c() +
  theme_minimal()

The code above is similar to the code that we used for the scatter plot and the bar chart, except for two changes:

A histogram of the city miles per gallon of the cars, with titles, labels, legend, color, and minimal theme

We map the cty variable to the fill aesthetic in the aes function, which fills the bars with different colors according to the cty value.

We use the scale_fill_viridis_c function instead of the scale_fill_viridis_d function, which uses a continuous color palette from the viridis package, suitable for continuous variables.

The plot looks much better than the previous one. It has a title, a subtitle, a caption, an x-axis and y-axis label, a legend, and a color that describes the plot and the data. It also has a minimal theme that makes the plot cleaner and simpler. You can also see that the bars have different colors according to the cty value, and the legend shows the range of the cty variable.

Related Posts

How to Change the stat and binwidth of the Histogram 

To change the stat and binwidth of the histogram, we can use the stat and binwidth arguments in the geom_histogram function. For example, if we want to show the density of the city miles per gallon instead of the count and use a smaller bin width, we can use the following code:

ggplot(mpg, aes(x = cty, fill = cty)) +
  geom_histogram(stat = "density", binwidth = 1) +
  labs(title = "Histogram of City Miles per Gallon of Cars",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "City miles per gallon",
       y = "Density of cars",
       fill = "City miles per gallon") +
  scale_fill_viridis_c() +
  theme_minimal()

The stat argument in the geom_histogram function specifies the statistical transformation we want to apply to the y variable. In this case, we use the "density" stat, which calculates the density of cases in each bin. 

A histogram of the city miles per gallon of the cars, with titles, labels, legend, color, and minimal theme, showing the density and using a smaller bin width

The binwidth argument in the geom_histogram function specifies the width of the bins we want to use for the histogram. In this case, we use a fixed value of 1, meaning each bin has a width of 1 unit. 

The plot shows the density of the city miles per gallon of the cars in the mpg data set using a smaller bin width. You can see that the y-axis shows the density of the x variable, which ranges from 0 to 0.2. You can also see that the bars have different heights according to the density of each bin.

Line Charts

A line chart is a type of plot that shows a continuous variable's trend over time or compares the trends of different categories over time. To create a line chart with ggplot2, you can use the geom_line function. The geom_line function takes the following arguments:

  • x: the variable that you want to map to the x-axis, which is usually a time variable
  • y: the variable that you want to map to the y-axis, which is usually a continuous variable
  • group: the variable that you want to use to group the lines, which is usually a categorical variable
  • color: the variable that you want to use to color the lines, which is usually the same as the group variable
For example, to create a line chart of the highway miles per gallon (hwy) of the cars in the mpg data set over the years, grouped and colored by the class of the cars, you can use the following code:

ggplot(mpg, aes(x = year, y = hwy, group = class, color = class)) +
  geom_line()

  • The first argument of the ggplot function is the data frame you want to use for your plot, in this case, mpg. 

A line chart of the highway miles per gallon of the cars over the years, grouped and colored by the class of the cars

  • The second argument is the aesthetic mapping you want to use for your plot, which the aes function specifies. The aes function takes the variables you want to map to the plot elements, such as the x-axis, the y-axis, the group, and the color. In this case, we map the year variable to the x-axis, the hwy variable to the y-axis, the class variable to the group and the color.
The + operator adds a geom function to the ggplot object, specifying the plot type you want to create. In this case, we use the geom_line function, which adds lines to the plot. The geom_line function can also take additional arguments, such as the size, the line type, and the alpha of the lines, which can be either fixed values or mapped to variables.

It is a basic line chart that shows the trend of the highway miles per gallon of the cars in the mpg data set over the years, grouped and colored by the class of the cars. You can see that the x-axis shows the range of the year variable, and the y-axis shows the range of the hwy variable. 

You can also see seven different classes of cars in the data set, each with a different color and line. You can also see that some classes have higher or lower miles per gallon than others and that some classes have more or less variation over the years.

How to add titles, labels, legends, themes, and colors to a line chart

We can improve it by adding titles, labels, legends, themes, and colors to our plot, just like we did for the previous sections' scatter plot, bar chart, and histogram. For example, to add a title, a subtitle, a caption, an x-axis and y-axis label, and a minimal theme to our plot, we can use the following code:

ggplot(mpg, aes(x = year, y = hwy, group = class, color = class)) +
  geom_line() +
  labs(title = "Line Chart of Highway Miles per Gallon of Cars over the Years",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "Year",
       y = "Highway miles per gallon",
       color = "Class of car") +
  theme_minimal()

The code above is similar to the code that we used for the scatter plot, the bar chart, and the histogram, except for one change:

A line chart of the highway miles per gallon of the cars over the years, grouped and colored by the class of the cars, with titles, labels, legend, and minimal theme

We add a color argument to the labs function, which adds a label to the color legend.

The plot looks much better than the previous one. It has a title, a subtitle, a caption, an x-axis and y-axis label, and a legend that describes the plot and the data. It also has a minimal theme that makes the plot cleaner and simpler.

To change the color palette of the plot, we can use the scale_color function in the ggplot2 package, just like we did for the scatter plot, the bar chart, and the histogram. For example, if we want to use a color palette from the RColorBrewer package, we can use the following code:

ggplot(mpg, aes(x = year, y = hwy, group = class, color = class)) +
  geom_line() +
  labs(title = "Line Chart of Highway Miles per Gallon of Cars over the Years",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "Year",
       y = "Highway miles per gallon",
       color = "Class of car") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal()

A line chart of the highway miles per gallon of the cars over the years, grouped and colored by the class of the cars, with titles, labels, legend, color, and minimal theme, using an RColorBrewer color palette

The scale_color_brewer function uses a color palette from the RColorBrewer package, a set of palettes designed by Cynthia Brewer for thematic maps. The palette argument specifies the name of the palette that we want to use, which can be either a character string or a number. We use the "Set1" palette of nine distinct colors in this case. 

The plot looks different than the previous one. It has a different color palette that is more diverse and contrasted. You can also see that the legend has changed accordingly.

Box Plots

A box plot shows the summary statistics of a continuous variable or the comparison of the summary statistics of a continuous variable across different categories. To create a box plot with ggplot2, you can use the geom_boxplot function. The geom_boxplot function takes the following arguments:

  • x: the variable that you want to map to the x-axis, which is usually a categorical variable
  • y: the variable that you want to map to the y-axis, which is usually a continuous variable
  • color: the variable that you want to use to color the boxes, which is usually the same as the x variable
  • fill: the variable that you want to use to fill the boxes, which is usually the same as the x variable
For example, to create a box plot of the highway miles per gallon (hwy) of the cars in the mpg data set, grouped and colored by the class of the cars, you can use the following code:

ggplot(mpg, aes(x = class, y = hwy, color = class, fill = class)) +
  geom_boxplot()

A box plot of the highway miles per gallon of the cars, grouped and colored by the class of the cars

  • The first argument of the ggplot function is the data frame you want to use for your plot, in this case, mpg. 
  • The second argument is the aesthetic mapping you want to use for your plot, which the aes function specifies. The aes function takes the variables you want to map to the plot elements, such as the x-axis, the y-axis, the color, and the fill. In this case, we map the class variable to the x-axis, the y-axis, the color, and the fill.

The + operator adds a geom function to the ggplot object, specifying the plot type you want to create. In this case, we use the geom_boxplot function, which adds boxes to the plot. The geom_boxplot function can also take additional arguments, such as the width, the notch, and the outlier.shape of the boxes, which can be either fixed values or mapped to variables.

It is a basic box plot that shows the summary statistics of the highway miles per gallon of the cars in the mpg data set, grouped and colored by the class of the cars. You can see that the x-axis shows the seven different classes of cars in the data set, and the y-axis shows the range of the hwy variable. You can also see that each class has a different color and box.

A box plot consists of five components:

  1. The lower whisker: the lowest value within 1.5 times the interquartile range (IQR) of the lower quartile
  2. The lower hinge: the 25th percentile or the lower quartile of the data
  3. The middle line: the 50th percentile or the median of the data
  4. The upper hinge: the 75th percentile or the upper quartile of the data
  5. The upper whisker: the highest value within 1.5 times the IQR of the upper quartile
You can also see that some classes have outliers, values beyond the whiskers. Dots represent the outliers.

A box plot can help you to compare the distribution of a continuous variable across different categories and to identify the outliers, the skewness, and the variability of the data. For example, from the plot above, you can see that:

  • The compact, midsize, and minivan classes have the highest median highway miles per gallon, while the suv, pickup, and 2seater classes have the lowest median highway miles per gallon
  • The 2-seater class has the most outliers, while the minivan class has no outliers.
  • The compact class has the most skewed distribution, while the minivan class has the most symmetric distribution.
  • The SUV class has the most variability, while the minivan class has the least.
However, the plot still needs to be more informative and attractive. We can improve it by adding titles, labels, legends, themes, and colours to our plot, just like we did for the previous sections' scatter plot, bar chart, and histogram. For example, to add a title, a subtitle, a caption, an x-axis and y-axis label, and a minimal theme to our plot, we can use the following code:

ggplot(mpg, aes(x = class, y = hwy, color = class, fill = class)) +
  geom_boxplot() +
  labs(title = "Box Plot of Highway Miles per Gallon of Cars by Class",
       subtitle = "Data from the mpg dataset",
       caption = "Source: ggplot2 package",
       x = "Class of car",
       y = "Highway miles per gallon",
       color = "Class of car",
       fill = "Class of car") +
  theme_minimal()

The code above is similar to the code that we used for the scatter plot, the bar chart, and the histogram, except for two changes:

A box plot of the highway miles per gallon of the cars, grouped and colored by the class of the cars, with titles, labels, legend, and minimal theme

We add a colour and a fill argument to the labs function, which adds a label to the colour and fill legend. We do not change the colour palette of the plot because the default colour palette of ggplot2 is suitable for box plots. The plot looks much better than the previous one. It has a title, a subtitle, a caption, an x-axis and y-axis label, and a legend that describes the plot and the data. It also has a minimal theme that makes the plot cleaner and simpler.

Conclusion

In this article, I have shown you how to create and customize various plots with ggplot2, such as scatter plots, bar charts, histograms, line charts, and box plots. 

I hope this article has helped you learn more about data visualization in R and appreciate the power and flexibility of ggplot2. ggplot2 is a great package for creating beautiful and informative graphics to help you explore, understand, and communicate your data.

Frequently Asked Questions (FAQs)

What is data visualization?

Data visualization is creating graphical representations of data, such as charts, graphs, maps, and diagrams, to communicate information, patterns, and insights effectively and efficiently.

What is ggplot2?

ggplot2 is a popular and powerful package for data visualization in R, which is based on the grammar of graphics, a framework that describes the components and rules of creating graphics.

What is a geom in ggplot2?

A geom is a function that specifies the type of plot element that you want to create, such as points, lines, bars, boxes, etc. Each geom has its arguments and aesthetics that control the appearance and behaviour of the plot element.

What is an aesthetic in ggplot2?

An aesthetic is a property of a plot element that can be mapped to a variable in the data, such as x, y, colour, size, shape, etc. An aesthetic mapping is specified by the aes function, which takes the aesthetic's name and the variable's name as arguments.

What is a stat in ggplot2?

A stat is a function that specifies the statistical transformation that you want to apply to the data before plotting, such as count, density, mean, median, etc. Each stat has arguments and outputs that control the transformation's calculation and output.

What is a layer in ggplot2?

A layer combines a geom, a stat, a position, and a set of mappings and parameters that define a single plot element. The layer function creates a layer which inputs the data, mapping, stat, geom, position, and other arguments.

What is a position in ggplot2?

A position specifies the position adjustment you want to apply to the plot elements when they overlap, such as stack, dodge, fill, jitter, etc. Each position has arguments and effects that control the plot elements' adjustment and appearance.

What is a theme in ggplot2?

A theme is a function that specifies the non-data elements of the plot, such as the title, the labels, the legend, the axes, the margins, and the background. Each theme has arguments and values controlling the plot's appearance and style.

What is Plotly?

plotly is a package for creating interactive web-based graphics with R, a wrapper for the plotly.js library, and a JavaScript library for interactive graphics. plotly provides two main functions to create interactive plots from ggplot2 objects: ggplotly and plot_ly.

What is SF?

Sf is a package for handling and manipulating spatial data in R, which is compatible with the tidyverse and ggplot2. Sf provides two main functions to create maps with ggplot2: st_read and geom_sf.



Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. You can visit this link and fill out the order form to hire me. You can also contact me at info@rstudiodatalab.com for any questions or inquiries. I will be happy to work with you and provide you with high-quality data analysis services.


About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment