Hello All,

Recently, I was asked to help out with an SEM cross-validation analysis. 
Initially, the project was based on "sample-splitting" where half of cases were 
randomly assigned to a training sample and half to a testing sample. Attempts 
to replicate a model developed in the training sample using the testing sample 
were not entirely successful. A number of parameter estimates were 
substantially different and were subsequently shown to be significantly 
different in multiple group analyses using cross-group constraints and a 
difference in chi-square test.

There is a discussion that starts on page 90 in Frank Harrell's book Regression 
Modeling Strategies that seems to shed light on why this might be the case. In 
essence, the results are largely a matter of the luck of the draw. Choose one 
random seed in splitting the sample and the results cross-validate. Choose 
another and they might not. 

The book then goes on to suggest some improvements on data splitting. The most 
promising of these appears to be bootstrapping. In the book, this typically 
involves fitting, say, a regression model in one’s entire dataset, fitting the 
model in a series of bootstrap datasets, and then applying the results of each 
bootstrap model to the original data, in order to derive a measure of optimism 
in something like R2 or MSE. 

Our SEM would likely require something slightly different. That is, we would 
need to develop a model based on the entire sample, run the sample model on a 
series of bootstrap datasets, obtain the average (as well as the SD and 95% CI) 
for each of the model parametersrs across the bootstrap samples, and then 
compare that with what we got running the model on the original sample. Some of 
my other books show something like this for regression (e.g., An R Companion to 
Applied Regression, page 187; The R Book, page 418). 

So now having provided quite a bit of background, let me ask a few questions:

1. Is there any general agreement that the approach I've suggested is the way 
to go? Are there others besides Dr. Harrell that I could cite in pursuing this 
approach?

2. Does anyone know of some substantial published applications of this approach 
using SEM?

3. Would any of the available R packages for SEM (e.g., lavaan, sem, OpenMx) be 
particulary straightforward to use in doing the bootstrapping? Thus far, the 
SEM has been done using MPLUS. I've not tried SEM in R yet, but would be 
interested in giving it a shot. The SEM itself is relatively straightforward. 
Four latent variables, one with 7 indicators and the others with 4 indicators 
each. A couple of indirect paths involving mediation. Some pretty non-normal 
data though.  Lots of missingness too that might need to be dealt with using 
Multiple Imputation. 

Thanks,

Paul 

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to