> On Mon, Feb 07, 2005 at 05:16:56PM -0500, [EMAIL PROTECTED] wrote: >> > On Mon, Feb 07, 2005 at 13:28:04 -0500, >> > >> > What you are saying here is that if you want more accurate statistics, >> you >> > need to sample more rows. That is true. However, the size of the >> sample >> > is essentially only dependent on the accuracy you need and not the >> size >> > of the population, for large populations. >> > >> That's nonsense. > > Huh, have you studied any statistics?
To what aspects of "statistics" are your referring. I was not a math major, no, but I did have my obligatory classes as well as algorithms and so on. I've only worked in the industry for over 20 years. I've worked with statistical analysis of data on multiple projects, ranging from medical instruments, compression, encryption, and web based recommendations systems. I assume "Huh, have you studied any statistics?" was a call for qualifications. And yes, some real math major would be helpful in this discussion because clearly there is a disconnect. The basic problem with a fixed sample is that is assumes a normal distribution. If data variation is evenly distributed across a set, then a sample of sufficient size would be valid for almost any data set. That isn't what I'm saying. If the data variation is NOT uniformly distributed across the data set, the sample size has to be larger because there is "more" data. I think I can explain with a visual. I started my career as an electrical engineer and took an experimental class called "computer science." Sure, it was a long time ago, but bare with me. When you look at a sine wave on an oscilloscope, you can see it clear as day. When you look at music on the scope, you know there are many waves there, but it is difficult to make heads or tails of it. (use xmms or winamp to see for yourself) The waves change in frequency, amplitude, and duration over a very large scale. That's why you use a spectrum analyzer to go from time domain to frequency domain. In frequency domain, you can see the trends better. This is the problem we are having. Currently, the analyze.c code is assuming a very regular data set. In which case, almost any size sample would work fine. What we see when we use it to analyze complex data sets, is that it can't characterize the data well. The solution is to completely change the statistics model to handle complex and unpredictably changing trends, or, increase the sample size. ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings