# Zhang, Wang, & Tong (2014)

## Supplementary Material for Zhang, Wang, and Tong (2014)

Mediation Analysis with Missing Data through Multiple Imputation and Bootstrap

## Use of R package bmem

To use bmem, first install it using the code on Line 1 if you haven't done so. The code on Line 2 loads the package. One can then read the data into R using the read.table function which reads data in text format separated by white space. After than, one can specify the model using equations as shown on Line 13-15. Note that each regression path is labelled in the equation. The labels can be used to specify the indirect effect or other effect to estimate. For example, on Line 18, both the mediation effect and the total effect are specified. The function bmem can be used to conducted the needed mediation analysis. Note that boot= specifies the number of bootstrap to use and m= specifies the number of imputations. One can also plot the bootstrap distribution of a direct or indirect effect.

install.packages('bmem')
library(bmem)

### Step 1. Prepare data  ###
# Read data into R and store them in an object called dset

### Step 2. Specify the model  ###
M1<-specifyEquations(exog.variances=T)
math = b*HE + c*ME
HE = a*ME

### Step 3. Run the analysis  ###
indirect<-c('a*b', 'a*b+c')
M1.res<-bmem(dset, M1, indirect, boot=1000, method='mi', m=100)

### plot the bootstrap distribution
plot(M1.res, 'a*b')
plot(M1.res, 'a*b+c')

## Use of SAS macros

The SAS macros below can be used for mediation analysis with missing data using multiple imputation and bootstrap. Now we briefly explain the functioning of each part of the SAS program.

Lines 3-9 of the SAS program specifies all global parameters that control multiple imputation and bootstrap for mediation analysis. This part is the one that a user needs to modify according to his/her data analysis environment. Line 3 specifies the directory and name of the data file to be used. Line 4 lists the names of the variables in the data file. Line 5 specifies the missing data value indicator. For example, 99999 in the data file represents a missing datum. Line 6 specifies the number of imputations for imputing missing data. Line 7 defines the number of bootstrap samples. A number larger than 1000 is usually recommended. Line 8 and Line 9 specify the confidence level and the random number generator seed, respectively.

Lines 15-22 first read data into SAS from the data file specified on line 3 and then change missing data to the SAS missing data format - a dot. Lines 26-28 impute missing data for the original data set with auxiliary variables and generate K imputed data sets. Lines 30-34 estimate the mediation model parameters for each imputed data set. Lines 37-74 collect the results from the multiple imputed data sets and save the point estimates of model parameters and mediation effect in a SAS data set called “pointest”. I think maybe we can rename this data set and I would like to call it “point estimates”.The SAS codes in this section produce point parameter estimates for the model parameters and the mediation effect based on the original data after multiple imputation.

Lines 77-88 generate B bootstrap samples from the original data set with the same sample size. Lines 91-95 impute each bootstrap sample independently for K times. Lines 98-143 produce point estimates of mediation model parameters and mediation effect for each bootstrap sample and collect the point estimates for all bootstrap samples in the SAS data set named “bootest”.

The last part of the SAS program from Line 146 to Line 195 calculates the bootstrap standard errors and the bias-corrected confidence intervals for mediation model parameters and mediation effect. It also generates a table containing the point estimates, standard errors, and confidence intervals in the SAS output window.

To use the SAS program, one only needs to first change the global parameters in Lines 3-9, usually only lines 3 and 4, and then run the whole SAS program from the beginning to the end.

/*** Setup the global parameters ***/
/*The parameters below should be changed accordingly*/
%LET filename="c:\mnarmediation\dataname.txt"; * data file directory and name;
%LET varname=x m y a1 a2;  *specify variable names in the data file. Please use x for the input variable, m for the mediation variable, and y for the output variable. a1 and a2 are two auxiliary variables in the data file. You can use any names except for x, m, and y for naming the auxiliary variables;
%LET missing=99999; *specify the missing data value;
%LET nimpute = 100; *define the number of imputations K;
%LET nboot = 1000;  *define the number of bootstraps B;
%LET alpha = 0.95;  *define the confidence level;
%LET seed = 2010;   *random number seed;
/*** End of setup of global parameters ***/

/*In general, there is no need to change the codes below*/
/*Read data into sas*/
DATA dset;
INFILE &filename;
INPUT &varname;
ARRAY nvarlist &varname;
DO OVER nvarlist;
IF nvarlist = &missing THEN nvarlist = .;
END;
RUN;

/*Use multiple imputation to obtain point estimates of the model parameters  based on the original data set*/
/*Imputing the original data set multipe times*/
PROC MI DATA=dset SEED=&seed NIMPUTE=&nimpute OUT=imputed NOPRINT;
VAR &varname;
RUN; QUIT;
/*Estimating model parameters for each imputed data set*/
PROC REG DATA=imputed OUTEST= est NOPRINT;
MODEL y = x m;
MODEL m = x;
BY _Imputation_;
RUN; QUIT;

/*Collecting results from mutiple imputations*/
DATA temp;
SET est;
id =INT((_N_-.1)/2)+1;
modelnum = MOD(_N_+1, 2)+1;
RUN;

DATA temp1;
SET temp;
ARRAY int iY iM;
ARRAY xpar c a;
ARRAY mpar b tmp1;
ARRAY sigma sy sm;
RETAIN  a b c iY iM sy sm;
BY id;
IF FIRST.id THEN DO I = 1 to 2;
int[I] = .;
xpar[I] = .;
mpar[I]=.;
sigma[I]=.;
END;
int[modelnum] = intercept;
xpar[modelnum] = x;
mpar[modelnum] = m;
sigma[modelnum] = _RMSE_;
IF LAST.id THEN OUTPUT;
KEEP  _imputation_ a b c iY iM sy sm;
RUN;
/*Calcuating mediation effects*/
DATA temp2;
SET temp1;
ab=a*b;
RUN;

/*Saving the point estimates of model parameters and mediation effect from multiple imputation into a data set named 'pointest'*/
PROC MEANS DATA=temp2 NOPRINT;
VAR a b c ab iY iM sy sm;
OUTPUT OUT=pointest MEAN(a b c ab iY iM sy sm)=a b c ab iY iM sy sm;
RUN;

/*** Bootstraping data to obtain standard errors and confidence intervals ***/
DATA bootsamp;
DO sampnum = 1 to &nboot;
DO i = 1 TO nobs;
ran = ROUND(RANUNI(&seed) * nobs);
SET dset
nobs = nobs
point = ran;
OUTPUT;
END;
END;
STOP;
RUN; QUIT;

/*** Imputing K data sets for each bootstrap sample ***/
PROC MI DATA=bootsamp SEED=&seed NIMPUTE=&nimpute OUT=imputed NOPRINT;
EM MAXITER = 500;
VAR &varname;
BY sampnum;
RUN; QUIT;

/*Estimate model parameters for each imputed data set (in total, there are B*K imputed data sets.)*/
PROC REG DATA=imputed OUTEST= est NOPRINT;
MODEL y = x m;
MODEL m = x;
BY sampnum _Imputation_;
RUN; QUIT;

/*Collecting results from different imputed data sets*/
DATA temp;
SET est;
id =INT((_N_-.1)/2)+1;
modelnum = MOD(_N_+1, 2)+1;
RUN;

DATA temp1;
SET temp;
ARRAY int iY iM;
ARRAY xpar c a;
ARRAY mpar b tmp1;
ARRAY sigma sy sm;
RETAIN  a b c iY iM sy sm;
BY id;
IF FIRST.id THEN DO I = 1 to 2;
int[I] = .;
xpar[I] = .;
mpar[I]=.;
sigma[I]=.;
END;
int[modelnum] = intercept;
xpar[modelnum] = x;
mpar[modelnum] = m;
sigma[modelnum] = _RMSE_;
IF LAST.id THEN OUTPUT;
KEEP  sampnum _imputation_ a b c iY iM sy sm;
RUN;

DATA temp2;
SET temp1;
ab=a*b;
RUN;

/*Compute point estimates of model parameters and mediation effect for each bootstrap sample and the results are saved in the data file named 'bootest'. */
PROC MEANS DATA=temp2 NOPRINT;
BY sampnum;
VAR a b c ab iY iM sy sm;
OUTPUT OUT=bootest MEAN(a b c ab iY iM sy sm)=a b c ab iY iM sy sm;
RUN;

/*** Calculate the BC intervals based on the point estimates from different bootstrap samples and produce a table containing the points estimates, standard errors, confidence intervals in the output window.***/
PROC IML;
START main;
USE pointst;
READ ALL INTO Y;
USE bootest;
READ ALL INTO X;

n=NROW(X);
m=NCOL(X);

bc_lo=J(1,m-3,0);
bc_up=J(1,m-3,0);
se=J(1,m-3,0);

alphas=1-(1-&alpha)/2;
zcrit = PROBIT(alphas);

DO j=1 TO m-3;
se[j]=SQRT((SSQ(X[,j+3]) -(SUM(X[,j+3]))**2/n)/(n-1));
number=0;
DO i=1 TO n;
IF X[i,j+3]<Y[j+2] THEN number=number+1;
END;
p=number/n;
z0hat=PROBIT(p);

q1=z0hat+(z0hat-zcrit);
q2=z0hat+(z0hat+zcrit);
alpha1=PROBNORM(q1);
alpha2=PROBNORM(q2);

vec=X[,j+3];
CALL SORT(vec,{1});

low=int(alpha1*(n+1));
up=int(alpha2*(n+1));
IF low<1 THEN low=1;
IF up>n THEN up=n;
bc_lo[j]=vec[low];
bc_up[j]=vec[up];
END;

result=Y[3:10]||se`||(bc_lo`)||(bc_up`);
MATTRIB result ROWNAME=({a, b, c, ab, iy, im, sy, sm})
COLNAME=({estiamtes se CI_lo CI_up})
LABEL='MEDIATION ANALYSIS RESULTS' FORMAT=f10.5;
PRINT result;
FINISH;
RUN main;
QUIT;