Re: [ECOLOG-L] Question: Is grouping/binning appropriate in regression analysis?

Andrew Rominger Mon, 29 Mar 2010 10:39:32 -0700

Hi Francisco,

I don't know of any papers, but binning has the potential to seriously
change your estimate of the slope (and intercept for that matter) of you
log-log data.  The fewer bins you use, the worse off you are.  When you have
very few bins you wind up with a scaling parameter (i.e. slope) tending
towards positive numbers.  When you think about it, the scaling parameter
with 2 data points should be 1--i.e. a strait line.


This all may depend on what the population scaling parameter is, and how you
independent sample is distributed.  I looked into a case where there was a
-0.25 scaling parameter, with 100 data points, body size log-normally
distributed.  I've pasted my R code below.  Even if you don't know R, you
can download it, open it, and just paste this code in.

So the conundrum is that to compare with other studies (if indeed that's
even possible) you'll need to bin your data.  But do to sound analysis of
your data you should probably not bin.  A comparison with other studies may
not be reasonable unless you can be sure that everyone's data is behaving
the same way to binning.

Hope that helps, good luck,
Andy

x <- rlnorm(100,meanlog=0,sdlog=1)

# is this what your independent data look like?
plot(density(x,from=0),main="body mass data")

y <- 2*x^-0.25 + rnorm(100,mean=0,sd=0.25)

# is this what your log-log plot looks like?
plot(log(x),log(y),main="body mass, density relationship")

print("estimated coefficients")
lm(log(y)~log(x))$coeff

slopes <- numeric(99)
for(i in 2:100) {
    bins <- cut(log(x),breaks=i)
    x.bin <- tapply(log(x),bins,function(x){min(x)+(max(x)-min(x))/2})
    y.bin <- tapply(y,bins,sum)
    slopes[i-1] <- lm(log(y.bin)~x.bin)$coeff[2]
}

# a plot to see the effect of binning on the estimated slope
plot(2:100,slopes,xlab="Number of bins",ylab="slope")




On Sun, Mar 28, 2010 at 12:17 PM, James J. Roper <[email protected]> wrote:

> The question really is, why form groups when you already have the two,
> numerical continuous variables that you want?  That is, what is the benefit
> of grouping?  I can think of none.  I personally think this is a historical
> thing that started when computers were unavavailable and it reduced the
> mathematics to do-able level.  Today, the stats works without grouping.
>
> Jim
>
> On Fri, Mar 26, 2010 at 09:30, Francisco de Castro <[email protected]
> >wrote:
>
> > Hi all,
> >
> > I have a question for the list regarding grouping (binning) of the
> > independent variable in a linear regression. This is routinely done
> > (at least in limnology) in studies involving so-called biomass
> > size-spectra. I'm aware of other (better) methods to fit non-linear
> > models. However, I need to compare my results with older literature
> > where this method is used widely, and I'd like to know first if the
> > method has a problem or if it is outright wrong.
> >
> > My independent variable is mean body size of the individuals of a
> > species (M) and the dependent is either biomass (B, g/m2) or
> > population density (D, indiv/m2) of the species. Body size is
> > lognormally distributed, and the number of species in the sample is
> > ~100. The model to fit is: D= aM^b. First, data are log-transformed in
> > order to apply linear least-squares regression. So the model becomes
> > log(D)= log(a)+ b log(M). The appropriateness of this transformation
> > and possible bias in the estimation of parameters have been discussed
> > before (Zar, Smith, others) so my question in not about that. After
> > log-transforming, sizes are grouped into even-spaced categories, and
> > the densities/biomasses for all sizes within a size group are summed
> > up. So, the independent variable becomes the center of each
> > log-size-bin, and the dependent becomes the sum of all log-densities
> > for each size-bin. Obviously, the number of data gets reduced from the
> > original N to the number of size groups/bins used. After grouping, the
> > log-log model is fitted by least-squares regression.
> >
> > So my questions are:
> > Is this binning of a log-transformed variable statistically
> > appropriate for this problem?
> > Shouldn't be better to use directly the size and density for each
> > species without any grouping?
> >
> > Thanks in advance for any suggestion or literature.
> > Cheers
> >
> > Francisco de Castro
> > Potsdam University
> >
>

Re: [ECOLOG-L] Question: Is grouping/binning appropriate in regression analysis?

Reply via email to