sturlamolden wrote: > robert wrote: > >> here the bootstrap test will as well tell us, that the confidence intervall >> narrows down by a factor ~sqrt(10) - just the same as if there would be >> 10-fold more of well distributed "new" data. Thus this kind of error >> estimation has no reasonable basis for data which is not very good. > > > The confidence intervals narrows when the amount of independent data > increases. If you don't understand why, then you lack a basic > understanding of statistics. Particularly, it is a fundamental > assumption in most statistical models that the data samples are > "IDENTICALLY AND INDEPENDENTLY DISTRIBUTED", often abbreviated "i.i.d." > And it certainly is assumed in this case. If you tell the computer (or > model) that you have i.i.d. data, it will assume it is i.i.d. data, > even when its not. The fundamental law of computer science also applies > to statistics: shit in = shit out. If you nevertheless provide data > that are not i.i.d., like you just did, you will simply obtain invalid > results. > > The confidence interval concerns uncertainty about the value of a > population parameter, not about the spread of your data sample. If you > collect more INDEPENDENT data, you know more about the population from > which the data was sampled. The confidence interval has the property > that it will contain the unknown "true correlation" 95% of the times it > is generated. Thus if you two samples WITH INDEPENDENT DATA from the > same population, one small and one large, the large sample will > generate a narrower confidence interval. Computer intensive methods > like bootstrapping and asymptotic approximations derived analytically > will behave similarly in this respect. However, if you are dumb enough > to just provide duplications of your data, the computer is dumb enough > to accept that they are obtained statistically independently. In > statistical jargon this is called "pseudo-sampling", and is one of the > most common fallacies among uneducated practitioners.
that duplication is just an extreme example to show my need: When I get the data, there can be an inherent filter/damping or other mean of clumping on the data which I don't know of beforehand. My model is basically linear (its a preparation step for ranking valuable input data for a classification task, thus for data reduction), just the degree of clumping in the data is unknown. Thus the formula for r is ok, but that bare i.i.d. formula for the error "(1-r**2)/sqrt(n)" (or bootstrap_test same) is blind on that. > Statistical software doesn't prevent the practitioner from shooting > himself in the leg; it actually makes it a lot easier. Anyone can paste > data from Excel into SPSS and hit "ANOVA" in the menu. Whether the > output makes any sense is a whole other story. One can duplicate each > sample three or four times, and SPSS would be ignorant of that fact. It > cannot guess that you are providing it with crappy data, and prevent > you from screwing up your analysis. The same goes for NumPy code. The > statistical formulas you type in Python have certain assumptions, and > when they are violated the output is of no value. The more severe the > violation, the less valuable is the output. > >> The interesting task is probably this: to check for linear correlation but >> "weight clumping of data" somehow for the error estimation. > > If you have a pathological data sample, then you need to specify your > knowledge in greater detail. Can you e.g. formulate a reasonable > stochastic model for your data, fit the model parameters using the > data, and then derive the correlation analytically? no, its too complex. Or its just: additional clumping/fractality in the data. Thus, linear correlation is supposed, but the x,y data distribution may have "less than 2 dimensions". No better model. Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track. ( My application data is even worse/blackbox, there is not even such "model" ) > I am beginning to think your problem is ill defined because you lack a > basic understanding of maths and statistics. For example, it seems you > were confusing numerical error (rounding and truncation error) with > statistical sampling error, you don't understand why standard errors > decrease with sample size, you are testing with pathological data, you > don't understand the difference between independent data and data > duplications, etc. You really need to pick up a statistics textbook and > do some reading, that's my advice. I think I understand all this very well. Its not on this level. The problem has also nothing to do with rounding, sampling errors etc. Of course the error ~1/sqrt(n) is the basic assumption - not what I not know, but what I "complain" about :-) (Thus I even guessed the "dumb" formula for r_err well before I saw it somewhere. This is all not the real question.). Yet I need a way to _NOT_ just fall on that ~1/sqrt(n) for the error, when there is unknown clumping in the data. It has to be a smarter - say automatic non-i.i.d. computation for a reasonable confidence intervall/error of the correlation - in absence of a final/total modell. Thats not even an exceptional application. In most measured data which is created by iso-timestep sampling (thus not "pathological" so far?), the space of some interesting 2 variables may be walked "non-iso". Think of any time series data, where most of the data is "boring"/redundant because the flux of the experiment is so, that interesting things happen only occasionally. In absence of a full model for the "whole history", one could try to preprocess the x,y data by attaching a density weight in order to make it "non-pathological" before feeding it into the formula for r,r_err. Yet this is expensive. Or one could think of computing a rough fractal dimension and decorate the error like fracconst * (1-r**2)/sqrt(n) The (fast) formula I'm looking for - possibly it doesn't exist - should do this in a rush. Robert -- http://mail.python.org/mailman/listinfo/python-list