Test for Normality in R – beyond graphical analysis

Normal Distribution

Introduction:

While working with Linear models chances are that one has tested the Residuals for normality. Some of the common tests i have come across are Jarque Berra, Histograms and QQ test. But recently i came across an article by B. W. Yap and C.H. Sim which discusses at length different types of Normality tests classifies normality tests into the following classes:

  1. chi-squared
  2. Moments
    1. Jarque–Bera (JB) test
    2. D’Agostino test
    3. Skewness test
    4. Kurtosis test
  3. Empirical distribution test:
    1. Kolmogorov-Smirnov (KS) test
    2. Lilliefors (LL) test
    3. Cramer–von Mises (CVM) test
    4. Anderson–Darling (AD) test
  4. Spacings
    1. Rao’s test
    2. Greenwood test
  5. Regression and Correlation
    1. Shapiro – Wilk (SW) test
    2. Shapiro–Francia test
    3. Ryan–Joiner tests
  6. Other special tests

The paper rightly points out that normality can be tested using graphical methods such as Quantile – Quantile (QQ) test, histograms or a box plot. However, the author states that for a conclusive evidence statistical tests are used to check if the normality assumption holds. The paper alongwith providing a quick description of the normality tests also studies the power of different types of each test using a monte carlo simulation. The paper concludes the following:

Our simulation results show that for symmetric short-tailed distributions, D’Agostino and Shapiro–Wilk tests have better power. For symmetric long-tailed distributions, the power of Jarque–Bera and D’Agostino tests is quite comparable with the Shapiro–Wilk test. As for asymmetric distributions, the Shapiro–Wilk test is the most powerful test followed by the Anderson–Darling test.

The main aim of this post is to follow this B. W. Yap and C.H. Sim and use R functions to perform these normality tests. My aim is not to replicate the results but introduce R packages and functions that will allow us to execute these tests in R. Along the way we will also try to link to useful resources for each test to keep this post short and precise.

I do observe some overlap in terms of R packages. For e.g. tseries package has function to perform JB and moments package also has function to perform JB test. Similarly , tseries package has function to perform KPSS and the same can be performed in basic R. I am here to only provide an introduction to these function and no way test which is more efficient or superirority. Hope you enjoy this !!!

Pearson’s chi-squared (CSQ) goodness-of-fit test:

The B. W. Yap and C.H. Sim discusses the CSQ test under section 2.3. The pearson.test() function from nortest package can be used to perform this test in R. The test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The code below executed the pearson.test() function in R using a random sample of 1000 with mean 0 and standard deviation 1. The high p-value indicates that we fail to reject the null hypothesis.

library(nortest)
pearson.test(rnorm(1000, mean = 0, sd = 1))
#Pearson chi-square normality test

#data:  rnorm(1000, mean = 0, sd = 1)
#P = 15.424, p-value = 0.9814

In the code below we perform the same test but using a log normal distribution. The low p-value (<0.05) indicates that we fail to reject the null hypothesis and that the distribution is not normal.

library(nortest)
pearson.test(rlnorm(1000, meanlog = 0, sdlog = 1))

#Pearson chi-square normality test

#data:  rlnorm(1000, meanlog = 0, sdlog = 1)
#P = 1184.4, p-value < 2.2e-16

Jarque-Berra (JB) test

The B. W. Yap and C.H. Sim paper describes the JB test under section 2.4.1. The jarque.bera.test() from tseries package in R can be used to check if the data is normally distributed. The test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The code below shows the output from the jarque.bera.test() function using a random normal data. We can always substitute the random data with residuals from linear regression to test for normality. The high p-value (>0.05) indicates that we fail to reject the null and the data is normally distributed.

library(tseries)
x <- rnorm(100)
jarque.bera.test(x)

#Jarque Bera Test

#data:  x
#X-squared = 0.76041, df = 2, p-value = 0.6837

Similarly, the code below uses random generated data from a uniform distribution. The low p-value (<0.05) indicates that we reject the null and the data is not normally distributed.

library(tseries)
x <- runif(100) # alternative
jarque.bera.test(x)

#Jarque Bera Test

#data:  x
#X-squared = 6.1917, df = 2, p-value = 0.04524

D ‘Agostino (DP) test

The B. W. Yap and C.H. Sim paper describes the DP test under section 2.4.2. The agostino.test() from moments package in R can be used to check if the data is normally distributed. The test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The code and the output below shows the function to test for normality using agostino test. The high p-value (>0.05) indicates that we fail to reject the null hypothesis and the data is normally distributed.

library(moments)
x <- rnorm(100, mean = 5, sd = 3)
agostino.test(x)

#D'Agostino skewness test

#data:  x
#skew = 0.31247, z = 1.33227, p-value = 0.1828
#alternative hypothesis: data have a skewness

The code and output below shows agostino test using random data generated from a lognormal distribution. The low p-value (<0.05) indicates that we reject the null and the data is not normally distributed.

library(moments)
set.seed(1234)
rln <- rlnorm(1000, meanlog = 0, sdlog = 1)
agostino.test(rln)
#######################################
# Output
########################################
#D'Agostino skewness test

#data:  rln
#skew = 4.4011, z = 25.5048, p-value < 2.2e-16
#alternative hypothesis: data have a skewness

Skewness Test:

The paper we have been discussing in this post does not explicit define the test but it classifies under the moments. The wikipedia article states that a

  • Normal distribution has a skewness of 0
  • Half-normal distribution has a skewness of just below 1
  • Exponential ditribution has a skewness of 2
  • Lognormal distribution can have a skewness of any positive value, depending on its parameters

The skewness() function in basic R package is used to calculate the skewness in the code below:

set.seed(1234)
nl <- rnorm(1000) # Normal distribution
rln <- rlnorm(1000) # Lognormal distribution
el <- rexp(1000) # exponential distribution
skewness(nl)  # Normal distribution
#[1] -0.005202026
skewness(rln) # Lognormal distribution
#[1] 4.114297
skewness(el) # exponential distribution
#[1] 1.729093

We see that the normally distributed data has skewness of 0 and exponential has a skewness of close to 2.

Kurtosis test:

The wikipedia article on kurtosis states the following:

The kurtosis of any univariate normal distribution is 3. It is common to compare the kurtosis of a distribution to this value. Distributions with kurtosis less than 3 are said to be platykurtic, although this does not imply the distribution is “flat-topped” as sometimes reported. Rather, it means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers. Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.

Hence, we can calculate excess kurtosis to determine if the data is normally distributed. The excess kurtosis is defined as kurtosis minus 3 . In R we can use the kurtosis() function to calculate the kurtosis and further calculate excess kurtosis as shown in the code below:

set.seed(123)
nl <- rnorm(1000, mean = 5, sd = 3)
rln <- rlnorm(1000, meanlog = 0, sdlog = 1)
kurtosis(nl)
ex1 = (kurtosis(nl) -3)
kurtosis(rln)
ex2 = (kurtosis(rln) -3)
print(ex1)
#[1] -0.07425344
print(ex2)
#[1] 46.9249

We note that lognormal has excess kurtosis whereas the kurtosis of a normally distributed data is close to zero.

Cramer–von Mises (CVM) test:

The B. W. Yap and C.H. Sim paper describes the CVM test under section 2.2.3. The nortest package in R provides the cvm.test() function to check if the data is normal. The Null Hypothesis is that the data is normally distributed.

library(nortest)
cvm.test(rnorm(100, mean = 5, sd = 3))

#######################################
# Output
########################################
#Cramer-von Mises normality test

#data:  rnorm(100, mean = 5, sd = 3)
#W = 0.041433, p-value = 0.6515

As shown in the output above the large p-value indicates that we fail to reject null hypothesis. The following code and output is generated using a uniformly distributed data. The small p-value indicates that we fail to reject the alternate hypothesis.

library(nortest)
cvm.test(runif(100, min = 2, max = 4))

#######################################
# Output
########################################
# Cramer-von Mises normality test

# data:  runif(100, min = 2, max = 4)
# W = 0.17564, p-value = 0.010

Anderson – Darling (AD) Test:

The B. W. Yap and C.H. Sim paper describes the AD test under section 2.2.4. The ad.test() function from nortest package in R provides the check to see if the data is normal. The test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The R code and output is shown below. The p-value of AD statistic is small hence we reject the null hypothesis that the data is normally distributed.

library(nortest)
ad.test(runif(100, min = 2, max = 4))

#Anderson-Darling normality test

#data:  runif(100, min = 2, max = 4)
# A = 1.3459, p-value = 0.001636

The p-value of AD statistic is large hence we fail to reject the null hypothesis. Hence, the data are normally distributed.

library(nortest)
ad.test(rnorm(100, mean = 5, sd = 3))

#Anderson-Darling normality test

#data:  rnorm(100, mean = 5, sd = 3)
#A = 0.62472, p-value = 0.101

Kolmogorov – Smirnov (KS) Test:

The B. W. Yap and C.H. Sim paper describes the KS test under section 2.2.1. The ks.test() function from base R is used to check if the data is normally distributed. The test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The code below is an example of performing KS test in R. The P-Value results indicate that we fail to reject the Null hypothesis that the data is normally distributed.

ks.test(rnorm(100, mean = 5, sd = 3), pnorm(100, mean = 5, sd = 3))

#Two-sample Kolmogorov-Smirnov test

#data:  rnorm(100, mean = 5, sd = 3) and pnorm(100, mean = 5, sd = 3)
#D = 0.94, p-value = 0.1386
#alternative hypothesis: two-sided

Lilliefores (LL) Test

The B. W. Yap and C.H. Sim paper describes the LL test under section 2.2.2. As pointed out in the paper the LL test is a modification of KS test described above. Hence the nortest package lists the function as ” Lilliefors (Kolmogorov-Smirnov) test for normality “. The LL test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The lillie.test() function from nortest package in R is used to check if the data is normally distributed. The high pvalue indicates that we fail to reject the null hypothesis and the data is normally distributed.

library(nortest)
lillie.test(rnorm(100, mean = 5, sd = 3))

#Lilliefors (Kolmogorov-Smirnov) normality test

#data:  rnorm(100, mean = 5, sd = 3)
#D = 0.063961, p-value = 0.4009

Now we generate random data from a uniform distribution. The code below shows the LL test output when the data is Uniformly distributed. The low pvalue (< 0.05) indicates that we reject the null hypothesis and the data is not normally distributed.

library(nortest)
lillie.test(runif(100, min = 2, max = 4))

#Lilliefors (Kolmogorov-Smirnov) normality test

#data:  runif(100, min = 2, max = 4)
#D = 0.10985, p-value = 0.00466

Rao’s test and Greenwood test:

Unfortunately i was not able to perform these tests. However, if you are curious feel free to dive into circular package in R. The package has examples to perform Rao’s test. If you find material related to testing these in R feel free to add a comment with links or articles to this post.

Shapiro Wilk (SW) test:

The B. W. Yap and C.H. Sim paper describes the SW test under section 2.1. The SW test can be performed using the basic R function shapiro.test() function. The test is defined as follows:

Null (H0): The data are normally distributed Alternate (HA): The data are not normally distributed

The code below is an example of performing SW test in R. The P-Value results indicate that we fail to reject the Null hypothesis that the data is normally distributed.

shapiro.test(rnorm(100, mean = 5, sd = 3))

#Shapiro-Wilk normality test

#data:  rnorm(100, mean = 5, sd = 3)
#W = 0.99496, p-value = 0.974

The P-Value (<0.05) results indicate that we reject the Null hypothesis that the data is normally distributed. We note that the W statistic is close to .

shapiro.test(runif(100, min = 2, max = 4))

#Shapiro-Wilk normality test

#data:  runif(100, min = 2, max = 4)
#W = 0.94662, p-value = 0.0005002

Shapiro Francia (SF) Test:

As highlighted in B. W. Yap and C.H. Sim under section 2.1 the SF test is a modified version of SW test. The sf.test() function from nortest package can be used to test for normality. The paper highlights that the SW test wa originally designed for a sample size < 50. The SF test is a modified version that can have a sample size between 3 and 5000. The nortest package also highlights this if we read the help section of sf.test() function nortest package. At this point i am going to leave it to the reader to try this test as the test is very similar to SW test performed above.

Refferences:

  1. A quick guide to CSQ test: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
  2. A quick guide to AD test: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm
  3. A quick guide to KS test: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
  4. A quick guide to SW test: https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm

Leave a Reply

WordPress.com.

Up ↑

%d bloggers like this: