robert wrote: > Robert Kern wrote: > http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y > > > tells: > probable error of r = 0.6745*(1-r**2)/sqrt(N) > > A simple function of r and N - quite what I expected above roughly for > the N-only dep.. But thus it is not sensitive to above considerations > about 'boring' data. With above example it would spit a decrease of this > probable coef-err from > 0.0115628571429 to 0.00548453410954 !
This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 . I do monte carlo on correlating random series: >>> X=numpy.random.random(10000) >>> l=[] >>> for i in range(200): ... Z=numpy.random.random(10000) ... l.append( sd.correlation((X,Z))[2] ) #collect coef's ... >>> mean(l) 0.000327657082234 >>> std(l) 0.0109120766158 # thats how the coef jitters >>> std(l)/sqrt(len(l)) 0.000771600337185 >>> len(l) 200 # now # 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079 # vs M.C. 0.0109120766158 ± 0.000771600337185 but the fancy factor of 0.6745 is significantly just fancy for r=0. then for a higher (0.5) correlation: >>> l=[] >>> for i in range(200): ... Z=numpy.random.random(10000)+array(range(10000))/10000.0 ... l.append( sd.correlation((X+array(range(10000))/10000.0,Z))[2] ) ... >>> mean(l) 0.498905642552 >>> std(l) 0.00546979583163 >>> std(l)/sqrt(len(l)) 0.000386772972425 #now: # 0.6745*(1-r**2)/sqrt(N) = 0.00512173224849) # vs M.C. 0.00546979583163 ± 0.000386772972425 => there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it. -- Now adding boring data: >>> boring=ones(10001)*0.5 >>> X=numpy.random.random(10000) >>> l=[] >>> for i in range(200): ... Z=concatenate((numpy.random.random(10000)+array(range(10000))/10000.0,boring)) ... l.append( sd.correlation((concatenate((X+array(range(10000))/10000.0,boring)),Z))[2] ) ... >>> mean(l) 0.712753628489 # r >>> std(l) 0.00316163649888 # r_err >>> std(l)/sqrt(len(l)) 0.0002235614608 # now: # 0.6745*(1-r**2)/sqrt(N) = 0.00234459971461 #N=20000 # vs M.C. streuung 0.00316163649888 ± 0.0002235614608 => the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N) => I'll use this formula to get a downside error estimate for the correlation coefficient: ------------------------------------------ | r_err_down ~= 1.0 * (1-r**2)/sqrt(N) | ------------------------------------------ (until I find a better one respecting the actual distribution of data) Would be interesting what MATLAB & Octave say ... -robert -- http://mail.python.org/mailman/listinfo/python-list