This depends on data size. If it fits in memory, a single pass through the sorted array to find the biggest differences would suffice.
If the data doesn't fit, you probably need a StorelessQuantile estimator like QuantileBin1D from the colt libraries. Then pick a resolution and do the single pass search. Cheers, -Ajo On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.ste...@gmail.com> wrote: > I am working on MATH-437 (turning K-S distribution into a proper K-S > test impl) and have to decide how to implement 2-sample tests. > Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a > K-S distribution, so for large samples just using the cdf we already > have is appropriate. For small samples (actually for any size > sample), the test statistic distribution is discrete and can be > computed exactly. A brute force way to do that is to enumerate all > of the n-m partitions of {0, ..., n+m-1} and compute all the > possible D_n,m values. R seems to use a more clever way to do > this. Does anyone have a reference for an efficient way to compute > the exact distribution, or background on where R got their > implementation? > > Absent a "clever" approach, I see three alternatives and would > appreciate some feedback: > > 0) just use the asymptotic distribution even for small samples > 1) fully enumerate all n-m partitions and compute the D_n,m as above > 1) use a monte carlo approach - instead of full enumeration of the > D_n,m, randomly generate a large number of splits and compute the > p-value for observed D_n,m by computing the number of random n-m > splits generate a D value less than what is observed. > > Thanks in advance for any feedback / pointers. > > Phil > > [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >