sturlamolden wrote: > robert wrote: > >>> t = r * sqrt( (n-2)/(1-r**2) ) > >> yet too lazy/practical for digging these things from there. You obviously >> got it - out of that, what would be a final estimate for an error range of r >> (n big) ? >> that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ? > > I gave you th formula. Solve for r and you get the confidence interval. > You will need to use the inverse cumulative Student t distribution. > > Another quick-and-dirty solution is to use bootstrapping. > > from numpy import mean, std, sum, sqrt, sort > from numpy.random import randint > > def bootstrap_correlation(x,y): > idx = randint(len(x),size=(1000,len(x))) > bx = x[idx] # reasmples x with replacement > by = y[idx] # resamples y with replacement > mx = mean(bx,1) > my = mean(by,1) > sx = std(bx,1) > sy = std(by,1) > r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) * > (by - my.repeat(len(y),0).reshape(by.shape)), 1) / > ((len(x)-1)*sx*sy)) > #bootstrap confidence interval (NB! biased) > return (r[25],r[975]) > > >> My main concern is, how to respect the fact, that the (x,y) points may not >> distribute well along the regression line. > > The bootstrap is "non-parametric" in the sense that it is distribution > free.
thanks for the bootstrap tester. It confirms mainly the "r_stderr = (1-r**2)/sqrt(n)" formula. The assymetry of r (-1..+1) is less a problem. Yet my main problem, how to respect clumpy distribution in the data points, is still the same. In practice think of a situation where data out of an experiment has an unkown damping/filter (or whatever unkown data clumper) on it, thus lots of redundancy in effect. An extreme example is to just duplicate data: >>> x ,y =[0.,0,0,0,1]*10 ,[0.,1,1,1,1]*10 >>> xx,yy=[0.,0,0,0,1]*100,[0.,1,1,1,1]*100 >>> correlation(x,y) (0.25, 0.132582521472, 0.25, 0.75) >>> correlation(xx,yy) (0.25, 0.0419262745781, 0.25, 0.75) >>> bootstrap_correlation(array(x),array(y)) (0.148447544378, 0.375391432338) >>> bootstrap_correlation(array(xx),array(yy)) (0.215668822617, 0.285633303438) >>> here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good. The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation. So far I can only think of kind of geometric density approach... Or is there a commonly known straight forward approach/formula for this problem? In this formula which I can remember weakly somehow - I think there were other basic sum terms like sum_xxy, sum_xyy,.. in it (which are not needed for the formula for r itself ) Robert -- http://mail.python.org/mailman/listinfo/python-list