Key points
- A KS test compares the distribution of a sample with a reference or two samples. It does not make assumptions about the underlying distribution.
- You can use the ks.test function in R to perform a KS test with different arguments and additional parameters.
- You can select a reference distribution for a KS test based on your understanding of the data or research question. Alternatively, you can use an empirical CDF from a different sample or perform a bootstrap resampling.
- To understand the result of a KS test, compare the p-value to your significance level. This will help you determine if the sample matches the reference distribution or if two samples have the same distribution.
A Kolmogorov-Smirnov test (KS test) is a nonparametric test that compares a sample's cumulative distribution function (CDF) with a reference CDF or the CDFs of two samples.
The sentence can be simplified and split into several shorter coherent sentences as follows: It checks if a sample follows a specific distribution. The specific distributions include normal, exponential, or uniform. It also checks if two samples have the same distribution.
In R, the ks.test function can perform a KS test. The syntax of the function is:
ks.test(x, y, ...)
# ...: additional arguments to be passed to the functions specified by x and y
The ks.test function returns an object of class "htest" that contains the following components:
statistic: the value of the KS test statisticp.value: the p-value of the test
Alternative: The alternative hypothesis
Method: A character string indicating what type of test was performed
data.name: a character string giving the name(s) of the data
Example 1: Test whether a sample follows a normal distribution
Suppose we have a sample of 100 observations to test whether it follows a normal distribution. We can use the ks.test function with the argument y = "pnorm" to specify the reference CDF as the standard normal CDF.# Generate a sample of 100 observations from a normal distribution with a mean of 0 and a standard deviation of 1 set.seed(123) x <- rnorm(100) # Perform a KS test with the standard normal CDF as the reference ks.test(x, y = "pnorm")
The output shows that the KS test statistic is 0.093034, and the p-value is 0.3522. Since the p-value is greater than 0.05, we fail to reject the null hypothesis that the sample follows a normal distribution.
Example 2: Test whether a sample follows an exponential distribution with estimated parameters
Suppose we have another sample of 100 observations to test whether it follows an exponential distribution. However, we do not know the rate parameter of the exponential distribution, so we need to estimate it from the sample data.
We can use the ks.test function with the argument y = "pexp" to specify the reference CDF as the exponential CDF and pass the estimated rate parameter as an additional argument.
# Generate a sample of 100 observations from an exponential distribution with a rate of 0.5
set.seed(456)
x <- rexp(100, rate = 0.5)
# Estimate the rate parameter from the sample data
lambda <- 1 / mean(x)
# Perform a KS test with the exponential CDF as the reference and the estimated rate parameter as an additional argument
ks.test(x, y = "pexp", rate = lambda)
Example 3: Test whether two samples have the same distribution
Suppose we have two samples of different sizes to test whether they have the same distribution. We can use the ks.test function with both samples as arguments.# Generate two samples of different sizes from different distributions
set.seed(789)
x <- runif(50, min = 0, max = 1) # uniform distribution on [0, 1]
y <- rbeta(100, shape1 = 2, shape2 = 2) # beta distribution with parameters 2 and 2
# Perform a KS test with the two samples as arguments
ks.test(x, y)
The output shows that the KS test statistic is 0.23, and the p-value is 0.05551. Since the p-value is less than 0.05, we reject the null hypothesis that the two samples have the same distribution.
How to choose a reference distribution for a KS test
You can choose a reference distribution for a KS test based on your knowledge of the data or research question. For example, suppose you want to test whether your data follows a normal distribution.
In that case, you can use the y = "pnorm" argument to specify the standard normal CDF as the reference. Suppose you have some prior information about the mean and standard deviation of the normal distribution.
In that case, you can pass them as additional arguments, such as y = "pnorm", mean = 10, sd = 2. Alternatively, you can estimate the mean and standard deviation from the sample data using the mean and sd functions and pass them as additional arguments.
Similarly, suppose you want to test whether your data follows an exponential distribution. In that case, you can use the y = "pexp" argument to specify the exponential CDF as the reference. Suppose you have some prior information about the rate parameter of the exponential distribution.
In that case, you can pass it as an additional argument, such as y = "pexp", rate = 0.5. Alternatively, you can estimate the rate parameter from the sample data using the formula 1 / mean(x) and pass it as an additional argument.
You can also choose other continuous distributions as the reference, such as uniform (y = "punif"), gamma (y = "pgamma"), beta (y = "pbeta"), etc.
Suppose you do not have a specific theoretical distribution in mind, but you want to test whether your data follows a smooth curve that fits the data well. You can use the y = ecdf(z) argument to specify an empirical CDF based on another sample, z.
The sample z can be either from another group or treatment or from a bootstrap resampling of your original sample x. The bootstrap resampling method involves drawing random samples with replacements from your initial sample x and computing the empirical CDF for each resampled sample.
You can then use the average of these empirical CDFs as the reference CDF for your KS test. This method can account for the uncertainty in estimating the reference CDF from the data.
Conclusion
In this article, we have learned how to use the ks.test function in R to perform a Kolmogorov-Smirnov test. We have seen how to test whether a sample follows a specific theoretical distribution or whether two samples have the same distribution using different arguments and additional parameters. We have also seen how to interpret the output of the ks.test function, which includes the test statistic, the p-value, the alternative hypothesis, and the method.
Suppose you want to learn more about data analysis in R. In that case, you can check out our website, Data Analysis, where we provide tutorials related to RStudio, scientific articles, and books. You can also hire us for your data analysis projects by visiting our order page or contacting us at info@rstudiodatalab.com. We hope you enjoyed this article and found it helpful. Thank you for reading!
FAQs
What is a Kolmogorov-Smirnov test?
A Kolmogorov-Smirnov test is a nonparametric test that compares the cumulative distribution function of a sample with a reference CDF or the CDFs of two samples.
What is the null hypothesis of a KS test?
The null hypothesis of a KS test is that the sample follows the reference distribution or that the two models have the same distribution.
What is the alternative hypothesis of a KS test?
The alternative hypothesis of a KS test is that the sample does not follow the reference distribution or that the two models have different distributions.
How do I perform a KS test in R?
You can use the ks.test function in R to perform a KS test. The process takes two arguments: x, which is a numeric vector of data values or a function that generates them, and y, which is either a character string naming a continuous distribution function, a function that generates data values, or another numeric vector of data values. You can also pass additional arguments to the procedures specified by x and y.
How do I interpret the output of a KS test in R?
The output of a KS test in R includes the following components:
- statistic: the value of the KS test statistic
- p.value: the p-value of the test
- Alternative: the alternative hypothesis
- Method: A character string indicating what type of test was performed
- data.name: a character string giving the name(s) of the data
You fail to reject the null hypothesis if the p-value is greater than or equal to the significance level.
What are some applications of a KS test?
A KS test can test whether a sample follows a specific theoretical distribution, such as usual, exponential, or uniform. This can be useful for checking assumptions of parametric tests or models or for exploring data characteristics. A KS test can also test whether two samples have the same distribution. This can be useful for comparing groups or treatments or testing independence or homogeneity.
What are some limitations of a KS test?
A KS test has some limitations, such as:
- It is sensitive to outliers and ties in the data.
- It may not have enough power to detect slight differences in distributions.
- It may not be appropriate for discrete or categorical data.
- It may not account for parameters estimated from the data.
How do I choose a reference distribution for a KS test?
You can choose a reference distribution for a KS test based on your knowledge of the data or research question.