Robert Kern wrote: > robert wrote: >> Is there a ready made function in numpy/scipy to compute the correlation >> y=mx+o of an X and Y fast: >> m, m-err, o, o-err, r-coef,r-coef-err ? > > And of course, those three parameters are not particularly meaningful > together. > If your model is truly "y is a linear response given x with normal noise" then > "y=m*x+o" is correct, and all of the information that you can get from the > data > will be found in the estimates of m and o and the covariance matrix of the > estimates. > > On the other hand, if your model is that "(x, y) is distributed as a bivariate > normal distribution" then "y=m*x+o" is not a particularly good representation > of > the model. You should instead estimate the mean vector and covariance matrix > of > (x, y). Your correlation coefficient will be the off-diagonal term after > dividing out the marginal standard deviations. > > The difference between the two models is that the first places no restrictions > on the distribution of x. The second does; both the x and y marginal > distributions need to be normal. Under the first model, the correlation > coefficient has no meaning.
Think the difference is little in practice - when you head for usable diagonals. Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to class-learning .. ) Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coef-err). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3 - see below ) points. With big coef's and lots of distributed data the coef is very good by itself - its error range err(N) only approx ~ 1/sqrt(N) One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all. More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice. Now think of adding 'boring data': >>> X=[1.,2,3,4] >>> Y=[1.,2,3,5] >>> sd.correlation((X,Y)) # my old func (1.3, -0.5, 0.982707629824) # m,o,coef >>> numpy.corrcoef((X,Y)) array([[ 1. , 0.98270763], [ 0.98270763, 1. ]]) >>> XX=[1.,1,1,1,1,2,3,4] >>> YY=[1.,1,1,1,1,2,3,5] >>> sd.correlation((XX,YY)) (1.23684210526, -0.289473684211, 0.988433774639) >>> I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ... I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore. http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia says: In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value. Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too. http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y tells: probable error of r = 0.6745*(1-r**2)/sqrt(N) A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from 0.0115628571429 to 0.00548453410954 ! And the absolute size of this error measure seems to be too low for just 4 points of data! The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc... Robert PS: my old func is simply hands-on based on n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum() Guess its already fast for large data? Note: numpy.corrcoef strikes on 2 points: >>> numpy.corrcoef(([1,2],[1,2])) array([[ -1.#IND, -1.#IND], [ -1.#IND, -1.#IND]]) >>> sd.correlation(([1,2],[1,2])) (1, 0, 1.0) >>> >>> numpy.corrcoef(([1,2,3],[1,2,3])) array([[ 1., 1.], [ 1., 1.]]) >>> sd.correlation(([1,2,3],[1,2,3])) (1, 0, 1.0) PPS: A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy. -- http://mail.python.org/mailman/listinfo/python-list