Simulating Multivariate Random Normal Data using Statistical Computing Platform R

Many faculty members, as well as students, in the area of educational research methodology, sometimes have a need for generating data to use for simulation and computation purposes, demonstration of multivariate analysis techniques, or construction of student projects or assignments. As a great teaching tool, using simulated data helps us understand the intricacies of statistical concepts and techniques. The process of generating multivariate normal data is a nontrivial process and practical guides without dense mathematics are limited in the literature (Nissen and Saft, 2014). Hence, the purpose of this paper is to offer researchers a practical guide for and a quick access to generating multivariate random data with a given mean and variance-covariance structure. A detailed outline of simulating multivariate normal data with a given mean and variance-covariance matrix using Eigen (or spectral) and Cholesky decompositions is presented and implemented in statistical computing platform R version 3.4.4 (R Core Team, 2018).


INTRODUCTION
Many of the multivariate statistical techniques, such as multiple regression and multiple analysis of variance, require an assumption of multivariate normality of the continuous variables.
Although some of these techniques are robust to the violation of the multivariate normality assumption, such violations of multivariate normality increase the chances of researchers committing to Type I and/or Type II errors. Additionally, the correlations among variables can be distorted as a result of non-normality (Courtney and Chang, 2018). Hence, a major portion of the data screening efforts for our statistical analyses needs to focus on assessing and correcting for non-normality.
However, in many cases it can be better to provide our audiences with a multivariate normal data set in order for them to understand what such data look like before they can try to assess or identify non-normality. Furthermore, many teachers may recall instances in which they wished that they had a multivariate raw dat set with a specified set of means, standard deviations and correlation matrix for the variables. It is for these reasons, a brief discussion and an accessible illustration of how to simulate multivariate normal data with a given mean and variance-covariance matrix using Eigen and Cholesky decompositions may be a necessary topic.

Generating Multivariate Normal Data
Although there are several studies reporting the generation of multivariate normal data using various software packages and computing platforms, the process of generating multivariate normal data is a nontrivial process and practical guides without dense mathematics are limited in the literature (Demirtas, 2004;Goldman and McKenzie, 2009;Hunt, 2001;Nissen and Saft, 2014). One can use several different ways to decompose a given data matrix. The two most widely used methods are the QR decomposition, which decomposes a matrix into an orthogonal matrix and an upper triangular matrix, and the singular value decomposition (SVD). There are also Cholesky and Eigen decompositions, which are special cases of the former two. The QR and the SVD are different from the Cholesky and Eigen decompositions because the latter approaches require the input data to be a square matrix, whereas QR and SVD can be applied to an m × n matrix. The Eigen and Cholesky decompositions are the two of the most commonly used methods.
Although the Eigen value decomposition method is more stable, the Cholesky decomposition method is faster, but not by a considerable amount (Venables and Ripley, 2002). The statistical computing platform R version 3.4.4 (R Core Team, 2018) was used to implement the above mentioned decompositions to generate several multivariate random normal data sets. Z V T where the elements of Z is a random sample from a normal distribution with mean 0 and variance 1. The partial R script for generating the multivariate normal data for three variables and 5,000 cases is given in Figure 1. names(dataeigen)<-c("x1","x2","x3") # Write raw data to a file and save Figure1. Partial R script for implementing the Eigen decomposition for generating multivariate normal data

Cholesky Decomposition
Alternatively, one can generate a multivariate normal random sample by using a matrix operation called Cholesky decomposition, which is considered to provide an efficient method for the case of a symmetric positive definite variancecovariance matrix Σ. The probability density function, pdf, of the multivariate normal distribution for the m dimensional vector X can be given by the following formula.
pdf(X) = (2π*det(Σ))^(-m/2) * exp(-0.5*(X-MU) T *(Σ -1 )*(X-MU)) In this equation, MU is the mean vector, and sigma, Σ, is a positive definite symmetric matrix called the variance-covariance matrix. To create X, an m x n matrix containing n samples from this distribution, it is only necessary to create an m x n vector Z, each of whose elements is a sample of the 1-dimensional normal distribution with mean 0 and variance 1. Then, one can proceed to determine the upper triangular Cholesky factor A of the matrix Σ, so that Σ = A T A. As a final step in the process, one needs to compute X = MU + A T Z.
In order to simulate from a multivariate normal distribution with a mean μ vector and variance-covariance matrix Σ, one needs to be able to express the variance-covariance matrix as Σ=AA T for some matrix A. This is the case if and only if Σ is a positive semi-definite or positive definite matrix, which is a symmetric matrix with non-negative eigenvalues (Golub and Van Loan, 1996). The Cholesky decomposition of Σ yields a lower triangular matrix A such that A times its transpose, A T , gives Σ back again. This matrix is used in generating the multivariate random normal data in the following manner. If one generates a vector Z of standard random normal numbers having a length equal to the dimensions of A, then multiplying A, which is the Cholesky decomposition of Σ, by Z, and adding the desired mean, one ends up with a matrix of the desired random samples.
Without the detailed mathematical proofs using matrix algebra, one can reason through this process conceptually by making the following observations. Var(AZ) = A Var(Z) A T as A is just a constant. Since the standard normal random numbers have a variance of 1, the variance of Z is the identity matrix I. Notice that Variance(AZ) = A I A T = A A T = Σ. Hence, the random data set generated aligns with the desired variance-covariance structure. The partial R script for generating the multivariate normal data for three variables and 5,000 cases is given in Figure 2. names(datachol)<-c("x1","x2","x3") # Write raw data to a file and save Figure2. Partial R script for implementing the Cholesky decomposition for generating multivariate normal data A comparison of the results of the multivariate normal data generated by using the Eigen and Cholesky decompositions is summarized in Table 1. Throughout the table, the comparisons are based on a multivariate normal data set generated for 5,000 observations and three variables, x1, x2, and x3. In addition to the theoretical and empirical means, standard deviations, correlation matrices, and residual correlation matrices, the root mean square error (RMSE) values, which are obtained by squaring the residuals, averaging the squares, and taking the square root, are also given in As presented in Table 2, the results of the A-D test for univariate normality revealed that each of the variables, x1, x2, and x3 generated by either Eigen or Cholesky decompositions follows a normal distribution. Additionally, Mardia multivariate normality test results confirm that the simulated data generated by both decompositions satisfy the multivariate normality assumption.

A generic algorithm
A generic algorithm for simulating a multivariate normal distribution with a given mean and variance-covariance structure can be described as a six-step algorithm. It is worth mentioning that the ability to tolerate positive semi-definite variance-covariance matrices may be hugely beneficial in avoiding crashes. It may be advisable to revise the variance-covariance matrices by eliminating linearly dependent columns when it is necessary to address potential singularity issues.

Using functions in R packages
Another way to generate multivariate normal data is to take advantage of the currently available functions in various R packages. Since Demirtas (2004) provided some R scripts for generating pseudo-random numbers from different multivariate distributions in the absence of R packages and functions, several R functions in R various packages has become available to simulate multivariate normal data. Only two of such packages and functions within those two packages are considered and illustrated here.
In R, one can use the mvrnorm() function from the MASS package to produce one or more samples from a specified multivariate normal distribution. The usage of the mvrnorm() function is based on the user provided arguments, such as the number of samples, a vector for the means of the variables, a variance-covariance matrix for the variables, and a tolerance value for the lack of positive definiteness. Based on the source code, the mvrnorm() function uses eigenvectors to generate the multivariate random normal samples. The partial R script used to simulate a multivariate normal data for three variables and 5,000 cases using the mvrnorm() function is given in Figure 3. In addition to using the mvrnorm() function from the MASS package, rmvnorm() function from the mvtnorm package can be used to generate multivariate normal data. The usage of the rmvnorm() function is based on the user provided arguments, such as the number of observations, a vector for the means of the variables, a variance-covariance matrix for the variables, and a choice for the method. The method argument of the rmvnorm() function has three options for the generation algorithms. These generation algorithms are Eigen value, Cholesky, and singular value decompositions, with the eigen, chol, and svd choices for the method argument, respectively. The partial R script used to simulate a multivariate normal data for three variables and 5,000 cases using the rmvnorm() function is given in Figure 4. A comparative summary of the results of the multivariate normal data generated by the mvrnorm() and rmvnorm() functions from the R packages MASS and mvtnorm, respectively, is presented in Table 3. Throughout the table, the comparisons are based on a multivariate normal data set generated for 5,000 observations and three variables, x1, x2, and x3. In addition to the theoretical and empirical means, standard deviations, correlation matrices, residual correlation matrices, and the RMSE values, the results of the multivariate normality tests for the simulated data are also given in  Table 3, the results of the A-D test for univariate normality revealed that each of the variables, x1, x2, and x3 generated by either mvrnorm() or rmvnorm() functions follows a normal distribution. Additionally, Mardia multivariate normality test results confirm that the simulated data generated by both R functions satisfy the multivariate normality assumption.

Conclusions and Discussion
In this paper, I illustrated several different approaches to generating or simulating multivariate normal samples with detailed comparisons of two decomposition techniques, Eigen and Cholesky, and of two functions, mvrnorm() and rmvnorm(), from two R packages, MASS and mvtnorm, respectively. A six-step algorithm is presented and implemented in R to simulate data from a multivariate normal distribution. Even though there are various functions in several R packages, not every software package offers multivariate data generators. Hence, an understanding of the algorithm presented here might be helpful to some readers.
The results of these simulations can possibly be used in a number of different ways. First, one can create raw data when one has access to only summarized results from published journal articles or textbook exercises in which only summarized data are provided. The simulated raw data sets can then be used to assess the assumptions graphically, numerically, and inferentially for such results and exercises. Second, one may create their own exercise problems with a set of specified characteristics, or a specific set of outcomes such as, a data set for which a null hypothesis is rejected or failed to be rejected at a given significance level.
Additionally, these simulations may provide instructors with a fairly straightforward procedure to generate different data sets for different students for the purpose of integrity of exams, quizzes or homework assignments or demonstrations for lectures. But, one needs to understand that simulated data are not real data and should not be presented as such. It is more appropriate to view simulated data as realistic look-alikes for the real data in a given context.
Finally, a word of caution is in order. The six-step algorithm and the procedures presented here may not be the most efficient implementations of the two decompositions, Eigen and Cholesky. It may be possible to create or find more efficient implementations elsewhere. Also, one needs to keep in mind that the ultimate outcome in simulating data is within the quality of in-built random number generation process of the respective piece of software.