Some comments from Mark Taper. I have PDFs of the papers he mentions and will forward along if asked. Thanks.
Tim [email protected] Thanks for forwarding this dialogue to me. I am excited to see it happening. I think I can make a few useful comments. 1) small a priori model sets and all possible subsets are both useful approaches to model identification. They do however serve different functions and have different costs and benefits. Small sets allow you to compare complex models with a lot of fine structure. This is good when you know a lot about your system. It risks missing important effects not considered. All possible subsets is extremely useful at the stage where background info leads you to suspect that any of a (possibly large) number factors might be important. It will find cool stuff. The cost of course is model selection bias. Models with spuriously good fits will pop up by chance. To compensate for model selection bias in large model sets the complexity penalty needs to be increased. I discuss the problem in: Taper, M. L. (2004). Model identification from many candidates. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. M. L. Taper and S. R. Lele. Chicago, The University of Chicago Press: 448-524. ABSTRACT: Model identification is a necessary component of modern science. Model misspecification is a major, if not the dominant, source of error in the quantification of most scientific evidence. Hypothesis tests have become the de facto standard for evidence in the bulk of scientific work. Consequently, because hypothesis tests require a Single null and a Single alternative hypothesis there has been a very strong tendency to restrict the number of models considered in an analysis to two. I discuss the information criteria approach to model identification. The information criteria approach can be thought of as an extension of the likelihood ratio approach to the case of multiple alternatives. However, it has been claimed that information criteria are "confused" by too many alternative models and that selection should occur among a limited set of models. I demonstrate that the information criteria approach can be extended to large sets of models. There is a tradeoff between in the amount of model detail that can be accurately captured and the number of models that can be considered. This tradeoff can he incorporated in modifications of the parameter penalty term. This paper gives a simple procedure that will allow one to sort through large model sets using canned all-possible-subsets software. The Type II error problem (perhaps a better term is false detection) is real when working with multiple models, but much more manageable than one might imagine. The evidential paradigm gives researchers much finer control of error than does the classical error frequency paradigm (sometimes called frequentist statistics). My colleague Subhash Lele and I discuss this in a forth coming paper: Taper, M. L. and S. R. Lele (2010 (expected)). Evidence, Evidence Functions, and Error Probabilities. Handbook for Philosophy of Statistics M. R. Forster and P. S. Bandyophadhyay Elsevier Abstract: We discuss the evidential paradigm as we see it currently developing. We characterize evidential statistics as an epistemological tool and provide a list of qualities we feel would make this tool most effective. Evidentialism is often equated with likelihoodism, but we see likelihoodism as only an important special case of broader class of evidential statistics. Our approach gives evidentialism a theoretical foundation which likelihoodism lacked and allows extensions which solve a number of statistical problems. We discuss the role of error probabilities in evidential statistics, and develop several new error probability measures. These measures are likely to prove useful in practice and they certainly help to clarify the relationship between evidentialism and Neyman-Pearson style error statistics While I am shamelessly shilling my own papers, another good read is: Taper, M. L., D. F. Staples, et al. (2008). "Model structure adequacy analysis: selecting models on the basis of their ability to answer scientific questions." Synthese 163(3): 357-370. ABSTRACT: Models carry the meaning of science. This puts a tremendous burden on the process of model selection. In general practice, models are selected on the basis of their relative goodness of fit to data penalized by model complexity. However, this may not be the most effective approach for selecting models to answer a specific scientific question because model fit is sensitive to all aspects of a model, not just those relevant to the question. Model Structural Adequacy analysis is proposed as a means to select models based on their ability to answer specific scientific questions given the current understanding of the relevant aspects of the real world. 2) Summing AIC scores: I agree with other commentators that this is generally bogus. However, be ready for a fight if you express that opinion to an editor. Dave Hewit's comments point out the obvious problems with validating preconceptions, but similar problems can arise even without preconceptions. Consider the all possible subsets scenario discussed above. A variable with real but weak effect may be included in many supported models simply because it is uncorrelated with other variables and so nothing else will act a surrogate. It is interesting to realize that summing AIC scores is just a special case of the marginal inference so popular with Bayesians. Here again, you have a vast army of supporters, but it is still bogus for the same reasons that summing AIC scores is. Marginalization can create the appearance of support for parameter values and models where only weak support exists in the unmarginalized support surfaces. Subhash Lele and I plan on including some discussion of this in an upcoming paper on AIC in Ecology - so more details later. 3) Mixing metaphors: I disagree with the statistician who said that mixing statistical measures is inappropriate and misleading. This only holds if you take a behavioral decision approach to statistics (many do), but I think in science it is much more fruitful to seek to draw conclusions (sensu Tukey, J. W. (1960). "Conclusions vs Decisions." Technometrics 2(4): 423-433). Statistical metrics are not cabalistic charms placed in your paper to ward off reviewers, they are, or rather should be, tools to help you think about a scientific problem. As such, I have no problem with multiple measures; they are not incompatible, they are just different. Each gives you a different tool: p-values are error statistics that measure the reliability of a procedure, information criteria comparisons measure the relative support in the data for different models, and R2 values are statistical adequacy measures telling you how much of the potential information a given model captures (see Lindsay, B. G. 2004. Statistical distances as loss functions in assessing model adequacy. Pages 439-488 in M. L. Taper and S. R. Lele, editors. The nature of scientific evidence: Statistical, philosophical and empirical considerations. The University of Chicago Press, Chicago.). The problem is that this is hard stuff, misinterpretation of statistics is rampant by writers in the scientific literature -- misinterpretation is even worse by readers of the scientific literature. I think the onus is on the writer to do more than cite a statistical metric; he or she should also interpret that metric. What does it mean in the context of your thinking about your study. This will serve the dual function of helping educate newbies and aiding greybeards in assessing whether there are any problems with your thinking. We all need to beware of the "I know you think you understand what you thought I said, but I am not sure you realize that what I said is not what I meant" syndrome. Best Mark L. Taper
