Key points
 A contingency table is a way to show how often different categories of two or more variables occur together.
 You can make a twoway contingency table in R with the table() function. You can also add the totals and percentages of each category with the addmargins() and prop.table() functions.
 You can make a mosaic plot in R with the plot() function. A mosaic plot is a picture of a contingency table that shows the size and color of each category.
 Make a threeway contingency table in R with the ftable() function. A threeway contingency table is a way to show how three variables are related to each other.
 You can test if two variables are independent or not in R with the chisq.test() function. It compares how often each category actually occurs with how often it would occur by chance.
 You can measure how strong and in what direction the association between two variables is in R with the cor() function. It gives you a number between 0 and 1 that tells you how much one variable changes when another changes.
Function 
Description 
data() 
Loads a builtin dataset in R 
str() 
Displays the structure of an object 
table() 
Creates a contingency table from a subset of data 
addmargins() 
Adds row and column sums to a contingency table 
prop.table() 
Converts frequencies in a contingency table into
proportions or percentages 
plot() 
Creates a mosaic plot from a table object 
ftable() 
Creates a flat contingency table from a subset of
data 
as.data.frame() 
Converts an array into a data frame 
chisq.test() 
Performs a chisquared test of independence on a
table object 
cor() 
Calculates the phi coefficient or Cramer’s V from a table object 
Hi, I am Zubair Goraya, a PhD Scholar, Certified data analyst and freelancer with 5 years of experience. I’m also a contributor to Data Analysis, a website that provides tutorials related to Rstudio.
I am writing this article based on my PhD research paper, where I faced many challenges in analysing categorical data and found contingency tables to be a useful tool for exploring and testing the relationship between variables. In this article, I’ll show you how to create and interpret a contingency table in R, a useful tool for analysing the relationship between two or more categorical variables.
What is a Contingency table?
A contingency table, also known as a crosstabulation or crosstab, is a table that displays the frequency distribution of the categories of two or more variables. It can help you to summarize and compare the proportions of different groups, test hypotheses about the independence or association of the variables, and measure the strength and direction of the relationship.Why use contingency tables?
R Contingency tables are a simple and effective way to summarize and compare the frequency distribution of two or more categorical variables. You can also mention some of the benefits of using contingency tables, such as: Help you to identify patterns and trends in the data
 Help you to test hypotheses about the independence or association of the variables
 Help you to measure the strength and direction of the relationship between the variables
 Help you to visualise the data using mosaic plots or other graphical methods
How to Create a OneWay Contingency Table in R
For this tutorial, I’ll use the Titanic dataset that is available in R. This dataset contains information about the passengers on board the Titanic, such as their survival status, class, sex, age, and number of siblings/spouses/parents/children on board. You can load this dataset by using the data() function. It will create a fourdimensional array named Titanic in your workspace. You can view its structure by typing str.
data(Titanic)
str(Titanic)
The output will look something like this:
You can access any slice of this array using square brackets and specifying your desired levels. For example, if you want to see the frequencies of survival by sex and class, you can type:
xtabs(Freq ~ Sex + Class + Survived, data = as.data.frame(Titanic))
The output will look something like this:
The table displays the survival frequencies by sex, class, and age on the Titanic. It is divided into four sections, each representing a different combination of age and survival status: Child/No, Adult/No, Child/Yes, and Adult/Yes.
For the "Child, Survived = No" section, we observe that no children (both male and female) in the 1st and 2nd classes did not survive, but 35 children in the 3rd class did not survive. There were no children from any class in the crew category who did not survive.
In the "Adult, Survived = No" section, more adult males did not survive in all classes, with the largest count in the crew class (670). For adult females, the numbers are significantly lower, especially in the 1st and 2nd classes.
For the "Child, Survived = Yes" section, we see that some children did survive in all classes. The survivor count for males and females is generally lower than for adults, with the highest count in the 3rd class.
How to Create a TwoWay Contingency Table in R
A twoway contingency table is a table that shows the frequency distribution of two categorical variables. For example, if you want to see how survival status and sex are related in the Titanic dataset, you can create a twoway contingency table by using the table() function:Titanic<as.data.frame(Titanic) table(Titanic[, c("Sex", "Survived")])
The output will look something like this:
Sex 
Survived 

No 
Yes 

Male 
8 
8 
Female 
8 
8 
In each category, "Male" and "Female," there are eight individuals who did not survive ("No") and eight individuals who survived ("Yes"). This symmetrical distribution suggests an equal survival rate for males and females in the dataset.
How to Add Margins and Proportions to a Contingency Table
You can add the row and column sums to a contingency table using the addmargins() function. This can help you to compare the frequencies of different groups more easily. For example, if you want to add margins to the previous contingency table, you can type:
addmargins(table(Titanic[, c("Sex", "Survived")]))
The output will look something like this:
Sex 
Survived 
Sum 

No 
Yes 

Male 
8 
8 
16 
Female 
8 
8 
16 
Sum 
16 
16 
32 
The table provides a breakdown of survival by gender on the Titanic. In both the "Male" and "Female" categories, there are eight individuals who did not survive ("No") and eight individuals who survived ("Yes"). This balanced distribution results in 16 survivors and 16 nonsurvivors across both genders, constituting 32 individuals in the dataset.
It appears that, in this specific dataset, the survival rate is equal for both males and females, without any gender bias in survival outcomes. However, this analysis does not consider other variables affecting survival rates.
Proportions or Percentages in a Contingency Table
You can also convert the frequencies in a contingency table into proportions or percentages by using the prop.table() function. This can help you to compare the relative frequencies of different groups more easily. For example, if you want to see the proportions of survival by sex, you can type:
prop.table(table(Titanic[, c("Sex", "Survived")]))
The output will look something like this:
The results showed that in both the "Male" and "Female" categories, 25% of individuals did not survive ("No"), while 25% survived ("Yes").
Proportions by Row, Column, or Total using prop.table()
You can also specify whether you want to see the proportions by row, column, or total by using the margin argument in the prop.table() function. For example, if you want to see the proportions of survival within each sex group, you can type:
a<prop.table(table(Titanic[, c("Sex", "Survived")]), margin = 1) addmargins(a)
The output will look something like this:
Related Posts
How to Create a Mosaic Plot of a Contingency Table
A mosaic plot is a graphical representation of a contingency table that shows the frequencies or proportions of each combination of levels as rectangles with areas proportional to their values. It can help you to visualize the relationship between two or more categorical variables more easily.
To create a mosaic plot in R, you can use the plot() function with a table object as an argument. For example, if you want to create a mosaic plot of survival by sex, you can type:
plot(table(Titanic[, c("Sex", "Survived")]), main="Mosaic plot of a Contingency table")
Customize the Appearance of the Mosaic plot
You can also customize the appearance of your mosaic plot by using additional arguments in the plot() function, such as main (title), xlab (xaxis label), ylab (yaxis label), col (color), border (border color), and shade (shading). For example, if you want to create a mosaic plot with a title, labels, different colors, no borders, and shading based on the chisquared residuals, you can type:
plot(table(Titanic[, c("Sex", "Survived")]), main = "Survival by Sex on Titanic", xlab = "Sex", ylab = "Survived", col = c("pink", "lightblue"), border = NA)
How to Create a ThreeWay Contingency Table in R
A threeway contingency table is a table that shows the frequency distribution of three categorical variables. For example, if you want to see how survival status, sex, and class are related in the Titanic dataset, you can create a threeway contingency table by using the ftable() function:{
ftable(Titanic[, c("Class", "Sex", "Age", "Survived")])
The output will look something like this:
Survived 

Class 
Sex 
Age 
No 
Yes 
1st 
Male 
Child 
1 
1 
Adult 
1 
1 

Female 
Child 
1 
1 

Adult 
1 
1 

2nd 
Male 
Child 
1 
1 
Adult 
1 
1 

Female 
Child 
1 
1 

Adult 
1 
1 

3rd 
Male 
Child 
1 
1 
Adult 
1 
1 

Female 
Child 
1 
1 

Adult 
1 
1 

Crew 
Male 
Child 
1 
1 
Adult 
1 
1 

Female 
Child 
1 
1 


Adult 
1 
1 
The provided table presents the survival counts on the Titanic, categorized by class, gender, and age group. Remarkably, it shows that for each combination of these factors, the number of passengers who did not survive ("No") is exactly equal to the number who survived ("Yes"), resulting in a perfectly balanced distribution.
This balance suggests that in the given dataset, there is no differentiation in survival outcomes based on class, gender, or age group. Each class (1st, 2nd, 3rd, and Crew) displays the same number of survivors and nonsurvivors, regardless of gender (Male or Female) and age group (Child or Adult).
This symmetrical pattern raises some intriguing questions about the data collection process. It's unusual to find such perfectly balanced survival rates across these categories. It's possible that the dataset is artificially constructed or that some information is missing, leading to this uniform distribution.
In a realworld scenario, we expect to see variations in survival rates based on these factors, with certain groups having higher or lower survival chances. Therefore, it's essential to consider the reliability and completeness of the data when interpreting these results, as they may not accurately represent the actual events that transpired on the Titanic. Further investigation and data validation may be needed to draw meaningful conclusions about survival patterns in this context.
How to Test the Independence of Two Categorical Variables
One of the most common questions that arise when analyzing a contingency table is whether the two categorical variables are independent or not. Independence means that there is no relationship between the variables and that the frequency distribution of one variable does not depend on the value of the other variable. For example, if sex and survival status are independent on the Titanic, then we would expect that the proportion of survivors would be the same for males and females.
To test the independence of two categorical variables, we can use the chisquared test of independence. It compares the observed frequencies in a contingency table with the expected frequencies under the assumption of independence and calculates a chisquared statistic and a pvalue.
The chisquared statistic measures how much the observed frequencies deviate from the expected frequencies, and the pvalue measures how likely it is to observe such a deviation by chance. If the pvalue is less than a significance level (usually 0.05), then we can reject the null hypothesis of independence and conclude that there is a significant association between the variables.
To perform a chisquared test of independence in R, you can use the chisq.test() function with a table object as an argument. For example, if you want to test whether sex and survival status are independent on the Titanic, you can type:
chisq.test(table(Titanic[, c("Sex", "Survived")]))
The output will look something like this:
This shows that the chisquared statistic is very large (0) and the pvalue (1). It means we can accept the null hypothesis and conclude that there is no significant association between sex and survival status on the Titanic.
How to Measure the Strength and Direction of the Association Between Two Categorical Variables
Another question when analyzing a contingency table is how strong and in what direction is the relationship between two categorical variables. Strength means how much variation in one variable can be explained by another, and direction means whether the relationship is positive or negative.
For example, if sex and survival status are strongly associated with the Titanic, then we would expect that knowing one’s sex would help us predict one’s survival status better than guessing at random. If the association is positive, we expect higher values of one variable (e.g., female) would correspond to higher values of another variable (e.g., survived). If the association is negative, we would expect that higher values of one variable would correspond to lower values of another variable.
To measure the strength and direction of the association between two categorical variables, we can use various measures of association, such as the phi coefficient and Cramer’s V. These measures are based on the chisquared statistic and range from 0 to 1, where 0 indicates no association, and 1 indicates a perfect association. The direction of the association can be determined by looking at the sign of the correlation coefficient or by inspecting the contingency table.
To calculate the phi coefficient and Cramer’s V in R, you can use the cor() function with a table object as an argument. For example, if you want to measure the strength and direction of the association between sex and survival status on the Titanic, you can type:
cor(table(Titanic[, c("Sex", "Survived")]))
What analysis can we do after the contingency table?
Some of the possible analyses that you can perform on a contingency table in R, such as:
 You can calculate descriptive statistics, such as mean, median, mode, standard deviation, variance, range, etc., for each variable or group using the summary() or describe() functions.
 You can perform inferential statistics, such as ttests, ANOVA, chisquared tests, correlation tests, etc., to compare the means or proportions of different groups or test the significance of the relationship between variables using the t.test(), aov(), chisq.test(), cor.test(), etc., functions
 You can perform multivariate analysis, such as logistic regression, decision trees, cluster analysis, etc., to model the relationship between one or more dependent variables and one or more independent variables using the glm(), rpart(), kmeans(), etc., functions.
Tips and best practices
Tips and best practices on how to create and interpret a contingency table in R, such as:
 You should always check the quality and validity of your data before creating a contingency table. You should look for missing values, outliers, errors, inconsistencies, etc., and deal with them appropriately using the na.omit(), boxplot(), is.na(), etc., functions
 It would be best if you always chose the appropriate level of measurement for your categorical variables. You should use nominal variables for categories with no inherent order or rank, such as sex or colour. It would help to use ordinal variables for categories with a natural order or rank, such as class or age group. It would help if you used numeric variables for categories that have a numerical value or scale, such as income or height. It would help if you used factors() or ordered() functions to create categorical variables from numeric variables.
 You should always choose the appropriate type and size of contingency table for your analysis. You should use a twoway contingency table for two categorical variables, a threeway contingency table for three categorical variables, and so on. You should avoid creating too large or too small contingency tables that may be difficult to read or interpret. It would help if you used ftable() function to create flat contingency tables that are easier to display and manipulate
 You should always interpret your contingency table with caution and context. You should not make causal claims based on correlation alone. Consider other factors that may influence or confound the relationship between variables. It would help if you used appropriate measures of association and significance tests to support your conclusions. You should report your results clearly and accurately using proper notation and terminology.
Conclusion
 Create a twoway contingency table in R using the table() function
 Add margins and proportions to a contingency table using the addmargins() and prop.table() functions
 Create a mosaic plot of a contingency table using the plot() function
 Create a threeway contingency table in R using the ftable() function
 Test the independence of two categorical variables using the chisq.test() function
 Measure the strength and direction of the association between two categorical variables using the phi coefficient and Cramer’s V.
Frequently Asked Questions (FAQs)
What is the difference between a twoway and a threeway contingency table?
A twoway contingency table shows the frequency distribution of two categorical variables, while a threeway contingency table shows the frequency distribution of three categorical variables.
What is the difference between the phi coefficient and Cramer’s V?
The phi coefficient and Cramer’s V are both measures of association between two categorical variables, but they differ in how they account for the degree of freedom of the contingency table. The phi coefficient is equal to Cramer’s V when the contingency table has only two rows and two columns, but Cramer’s V will be smaller than the phi coefficient when the contingency table has more than two rows or columns.
How can I interpret the pvalue of the chisquared test of independence?
The pvalue of the chisquared test of independence measures how likely it is to observe a deviation from the expected frequencies under the assumption of independence by chance. If the pvalue is less than a significance level (usually 0.05), then you can reject the null hypothesis of independence and conclude that there is a significant association between the variables. If the pvalue is greater than or equal to the significance level, then you cannot reject the null hypothesis of independence and conclude that there is no evidence of association between the variables.
How can I interpret the sign and magnitude of the correlation coefficient?
The sign and magnitude of the correlation coefficient indicate the direction and strength of the association between two categorical variables. The sign can be positive or negative, where positive means that higher values of one variable correspond to higher values of another variable, and negative means that higher values of one variable correspond to lower values of another variable. The magnitude can range from 0 to 1, where 0 indicates no association, and 1 indicates a perfect association.
How can I create and interpret a contingency table with more than three categorical variables?
You can create and interpret a contingency table with more than three categorical variables by using nested ftable() functions or by using other packages such as gmodels or vcd. However, these methods may not be very practical or intuitive, as they may produce very large or complex tables that are difficult to read or visualize. Consider other ways to analyze your data, such as using logistic regression or decision trees.
Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. To hire me, you can visit this link and fill out the order form. You can also contact me at info@rstudiodatalab.com for any questions or inquiries. I will be happy to work with you and provide you with highquality data analysis services.