Some comments from Mark Taper.  I have PDFs of the papers he mentions and
will forward along if asked.  Thanks.

Tim
[email protected]

Thanks for forwarding this dialogue to me.  I am excited to see it
happening.  I think I can make a few useful comments.  

1)  small a priori model sets and all possible subsets are both useful
approaches to model identification.  They do however serve different
functions and have different costs and benefits.  Small sets allow you to
compare complex models with a lot of fine structure.  This is good when you
know a lot about your system.  It risks missing important effects not
considered.  All possible subsets is extremely useful at the stage where
background info leads you to suspect that any of a (possibly large) number
factors might be important.  It will find cool stuff.  The cost of course is
model selection bias.  Models with spuriously good fits will pop up by
chance.  To compensate for model selection bias in large model sets the
complexity penalty needs to be increased.  I discuss the problem in:

Taper, M. L. (2004). Model identification from many candidates. The Nature
of Scientific Evidence:  Statistical, Philosophical and Empirical
Considerations. M. L. Taper and S. R. Lele. Chicago, The University of
Chicago Press: 448-524.

ABSTRACT:    Model identification is a necessary component of modern
science. Model misspecification is a major, if not the dominant, source of
error in the quantification of most scientific evidence. Hypothesis tests
have become the de facto standard for evidence in the bulk of scientific
work. Consequently, because hypothesis tests require a Single null and a
Single alternative hypothesis there has been a very strong tendency to
restrict the number of models considered in an analysis to two. I discuss
the information criteria approach to model identification. The information
criteria approach can be thought of as an extension of the likelihood ratio
approach to the case of multiple alternatives. However, it has been claimed
that information criteria are "confused" by too many alternative models and
that selection should occur among a limited set of models. I demonstrate
that the information criteria approach can be extended to large sets of
models. There is a tradeoff between in the amount of model detail that can
be accurately captured and the number of models that can be considered. This
tradeoff can he incorporated in modifications of the parameter penalty term.

This paper gives a simple procedure that will allow one to sort through
large model sets using canned all-possible-subsets software.

The Type II error problem (perhaps a better term is false detection) is real
when working with multiple models, but much more manageable than one might
imagine.  The evidential paradigm gives researchers much finer control of
error than does the classical error frequency paradigm (sometimes called
frequentist statistics).  My colleague Subhash Lele and I discuss this in a
forth coming paper:

Taper, M. L. and S. R. Lele (2010 (expected)). Evidence, Evidence Functions,
and Error Probabilities. Handbook for Philosophy of Statistics M. R. Forster
and P. S. Bandyophadhyay Elsevier

Abstract:  We discuss the evidential paradigm as we see it currently
developing. We characterize evidential statistics as an epistemological tool
and provide a list of qualities we feel would make this tool most effective.
Evidentialism is often equated with likelihoodism, but we see likelihoodism
as only an important special case of broader class of evidential statistics.
Our approach gives evidentialism a theoretical foundation which
likelihoodism lacked and allows extensions which solve a number of
statistical problems. We discuss the role of error probabilities in
evidential statistics, and develop several new error probability measures.
These measures are likely to prove useful in practice and they certainly
help to clarify the relationship between evidentialism and Neyman-Pearson
style error statistics

While I am shamelessly shilling my own papers, another good read is:

Taper, M. L., D. F. Staples, et al. (2008). "Model structure adequacy
analysis: selecting models on the basis of their ability to answer
scientific questions." Synthese 163(3): 357-370.

ABSTRACT:    Models carry the meaning of science. This puts a tremendous
burden on the process of model selection. In general practice, models are
selected on the basis of their relative goodness of fit to data penalized by
model complexity. However, this may not be the most effective approach for
selecting models to answer a specific scientific question because model fit
is sensitive to all aspects of a model, not just those relevant to the
question. Model Structural Adequacy analysis is proposed as a means to
select models based on their ability to answer specific scientific questions
given the current understanding of the relevant aspects of the real world.

2) Summing AIC scores:
I agree with other commentators that this is generally bogus.  However, be
ready for a fight if you express that opinion to an editor.   Dave Hewit's
comments point out the obvious problems with validating preconceptions, but
similar problems can arise even without preconceptions.  Consider the all
possible subsets scenario discussed above.  A variable with real but weak
effect may be included in many supported models simply because it is
uncorrelated with other variables and so nothing else will act a surrogate.
 It is interesting to realize that summing AIC scores is just a special case
of the marginal inference so popular with Bayesians.  Here again, you have a
vast army of supporters, but it is still bogus for the same reasons that
summing AIC scores is.  Marginalization can create the appearance of support
for parameter values and models where only weak support exists in the
unmarginalized support surfaces.  Subhash Lele and I plan on including some
discussion of this in an upcoming paper on AIC in Ecology - so more details
later.

3) Mixing metaphors:
I disagree with the statistician who said that mixing statistical measures
is inappropriate and misleading.  This only holds if you take a behavioral
decision approach to statistics (many do), but I think in science it is much
more fruitful to seek to draw conclusions (sensu Tukey, J. W. (1960).
"Conclusions vs Decisions." Technometrics 2(4): 423-433).  Statistical
metrics are not cabalistic charms placed in your paper to ward off
reviewers, they are, or rather should be, tools to help you think about a
scientific problem.  As such, I have no problem with multiple measures; they
are not incompatible, they are just different.  Each gives you a different
tool: p-values are error statistics that measure the reliability of a
procedure, information criteria comparisons measure the relative support in
the data for different models, and R2 values are statistical adequacy
measures telling you how much of the potential information a given model
captures (see Lindsay, B. G. 2004. Statistical distances as loss functions
in assessing model adequacy. Pages 439-488 in M. L. Taper and S. R. Lele,
editors. The nature of scientific evidence: Statistical, philosophical and
empirical considerations. The University of Chicago Press, Chicago.). The
problem is that this is hard stuff,  misinterpretation of statistics is
rampant by writers in the scientific literature -- misinterpretation is even
worse by readers of the scientific literature.  I think the onus is on the
writer to do more than cite a statistical metric; he or she should also
interpret that metric.  What does it mean in the context of your thinking
about your study.  This will serve the dual function of helping educate
newbies and aiding greybeards in assessing whether there are any problems
with your thinking.  We all need to beware of the "I know you think you
understand what you thought I said, but I am not sure you realize that what
I said is not what I meant" syndrome.

Best Mark L. Taper

Reply via email to