Re: normality

DeSolla,Shane [Burlington] Fri, 25 Aug 2006 09:58:55 -0700

Hi,

I obviously was not clear enough. I was not arguing that the raw data
had to be normal.


One of your statements was, "Assumming bivariate normality means that
each of the variables should be normally distributed..."

I saw this as a statement that you were saying that the raw data had to
be normal. Your response indicates that you were not. And neither was I,
as was indicated in my text.

Thanks for the detailed response.

Cheers,
Shane


> -----Original Message-----
> From: Highland Statistics Ltd. [mailto:[EMAIL PROTECTED] 
> Sent: Friday, August 25, 2006 12:29 PM
> To: DeSolla,Shane [Burlington]; [EMAIL PROTECTED]
> Cc: [email protected]
> Subject: RE: normality
> 
> At 16:27 25/08/2006, DeSolla,Shane [Burlington] wrote:
> 
> >Sorry to take this off list, but I wasn't sure if I was grossly 
> >misunderstanding what you two are saying in the following email 
> >exchange. My apologies if I am not reading you correctly. See my 
> >comments below:
> >
> > > -----Original Message-----
> > > From: Ecological Society of America: grants, jobs, news
> > > 
> > [<mailto:[email protected]>mailto:[EMAIL PROTECTED]
> > On Behalf Of Highland
> > > Statistics Ltd.
> > > Sent: Friday, August 25, 2006 9:30 AM
> > > To: [email protected]
> > > Subject: Re: PCA question
> > >
> > > On Thu, 24 Aug 2006 11:04:42 -0300, James J. Roper 
> > > <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > >Steve,
> > > >
> > >
> > > Dear Jim,
> >
> ><< SNIPPED >>
> >
> > > >The variance is only a good estimate of the "true" 
> variance if the 
> > > >distribution is normal or transformable to normality, and
> > > so, normality
> > > >is required. Correlations, to be meaningful, also require
> > > normality, as
> > > >the statistic program is not using a covariance matrix based
> > > on ranks
> > > >(Spearman).
> >
> >What has to be normal? The raw data?
> 
> No..not the raw data...that is a misconception. 
> You have to assume that if you would repeat the sampling at 
> the same environmental conditions, then you will measure very 
> similar values. 
> Suppose you have the money/time/energy to do this...go 100 
> times into the field at the same environmental conditions, 
> and sample (have fun). 
> If you then make a scatterplot of your Y versus your X you 
> would hope to see a bell-shaped curve on top of the scatter 
> plot showing the range of all possible realisations. If it is 
> really a bell-shaped pattern you can assume normality. If the 
> spread at each X value is also the same, you can assume 
> homogeneity. Very often this is not the case..so you take a 
> hammer and knock on the data to make sure that the spread of 
> the data at each X value is the same (and that is called a 
> transformation). More elegant options are available (e.g. 
> different variances per strata)...see for example chapter 5 
> in Pinheiro and Bates for further options.
> 
> Very often, people don't have multiple
> observations per X value....very often only one (especially 
> in filed studies). So..technically you can't check for 
> normality or homogeneity...you can only pull all the 
> residuals and hope that these are normally distributed. But 
> it is not conclusive.
> 
> Now...confusion arises because of normality of the raw data 
> of the residuals....well...technically you can show that 
> normality of the raw data (Y), given the X, implies normality 
> of the residuals. So..you have to assume that the X are 
> without error....or else it all goes wrong. There is some 
> text in Faraway
> (2004) that shows how and why and where it goes wrong.
> 
> 
> >Or the error? I am no statistician, but it sounds like you 
> are talking 
> >about the distribution of the data. If so, why should the 
> raw data be 
> >normally distributed?
> 
> As explained above...it is the residuals. I would not 
> recommend checking for normality of the raw data. I see 
> students panicking with bimodal histograms of raw data..only 
> to discover that the bimodality is caused by a sex 
> effect...and the residuals were perfectly normally distributed.
> 
> 
> 
> 
> > > In some situations perhaps yes..but I can also imagine 
> situations in 
> > > which this does not hold. Suppose you are interested in the 
> > > correlation between a species abundance and temperature. 
> Assumming 
> > > bivariate normality means that each of the variables should be 
> > > normally distributed. So..most of your temperature values 
> should be 
> > > clumbed around a certain value.... If all the fun happens in this 
> > > specific temperature regime, then that is fine. But if 
> you have long 
> > > gradients, it is perhaps better to take equal number of samples 
> > > along the temperature gradient (this is also one of the 
> assumptions 
> > > in methods like canonical correspondence analysis and redundancy 
> > > analysis...see Ter Braak 1986).
> >
> >Again, it sounds like you two are arguing that the data has to be 
> >normally distributed.
> 
> No ..the residuals. Other things you should do is:
> 1. plot residuals versus fitted values. Check whether the 
> spread is the same everywhere. If not, you are in trouble 
> (heterogeneity). 
> Solution: add more covariates, improve your model, add 
> interactions, allow for different variances using GLS or 
> mixed modelling, etc. 
> Consider a Poisson distribution, or something stronger 2. 
> Plot residuals versus each explanatory variable. You don't 
> want to see any patterns. If you do see patterns.....trouble. 
> Consider adding more covariates, different model or apply 
> smoothing methods like GAM, among many other options.
> 3. Investigate the model for influential observations.
> 4. Check for independence.
> 
> If any of these points is violated, then you are in trouble. 
> I still have to see a publication using ecological data in 
> which linear regression is applied correctly. Anyone who has, 
> please send me the pdf and data so that I can use it in 
> academic courses. I can pinpoint various stats books and 
> Nature papers where the results show residual patterns, 
> violation of independence, different spread.
> 
> 
> So far a quick regression course.
> 
> Alain
> www.highstat.com
> 
> 
> 
> 
> >Here is a thought experiment (though you can try this if you like). 
> >Take a uniform distribution (very non-normal, right?). Take a random 
> >sample of, say 12 observations. Calculate the mean.
> >Take another sample of 12, calculate the mean. 
> >Repeat 1000 times. Plot the distribution of the randomly generated 
> >means. I bet the distribution of the means will be approximately 
> >normal, even though the raw data definitely isn't. Hence, the 
> >assumption of ANOVAs, for example, is that the expected 
> distribution of 
> >the means is normal (or, if you like, the residuals), not 
> the raw data. 
> >Ditto for regressions. If the raw data is normally 
> distributed, that is 
> >a sufficient condition for the residuals to be normal, but 
> it is not a 
> >necessary condition. Hence, if you show the data is normal 
> every thing 
> >is fine, but if the raw data is not normal, that alone is not 
> >sufficient for the assumptions of ANOVAs (correlations, regressions, 
> >etc) to be violated.
> >Do not the assumptions of multivariate normality follow 
> similar logic?
> >
> >Or, as I said, am I completely misreading what you two are saying?
> >
> >I have snipped the rest of the discussion, as you seemed to have 
> >(correctly, in my not-so-educated opinion) switched to discussing 
> >assumptions around the residuals, rather than the raw data. But it 
> >doesn't seem to fit what you have discussed earlier.
> >
> >Hope you don't mind this email - I am still trying to learn...
> >
> >Cheers,
> >Shane
> >
> >_____________________________________________
> >Shane de Solla
> >Wildlife Conservation Biologist
> >Canadian Wildlife Service
> >Canada Centre for Inland Waters
> >867 Lakeshore Road
> >Box 5050
> >Burlington, ON
> >L7R 4A6
> >Canada
> >
> >phone   905-336-4686
> >fax        905-336-6434
> >
> >Opinions expressed are those of the author and do not 
> represent those 
> >of his employer.
> >
> 
> 
> 
> Dr. Alain F. Zuur
> Highland Statistics Ltd.
> 6 Laverock road
> UK - AB41 6FN Newburgh
> 
> Tel: 0044 1358 788177
> Email: [EMAIL PROTECTED]
> URL: www.highstat.com
> URL: www.brodgar.com
> 
> Our statistics courses:
> 1. "Analysing biological and environmental data using 
> univariate methods".
> 2. "Analysing biological and environmental data using 
> multivariate methods"
> 3. "Analysing biological and environmental data using time 
> series analysis"
> 4. "Analysing biological and environmental data using mixed 
> modelling, GLMM and GAMM"
> 5. "An introduction to R"
> 
> Brodgar: Software for univariate and multivariate analysis 
> and multivariate time series analysis Brodgar complies with R 
> GNU GPL license
> 
> Statistical consultancy, courses, data analysis and software
> 
> 
> 
>

Re: normality

Reply via email to