This will be very useful. Sampling from discrete ECDF's is also closely related to generating samples from a multinomial distribution. I did a bit of work on the latter problem. The result of that work is in
org.apache.mahout.math.random.Multinomial The major difference that you will have is that you have an ordered domain that you are sampling from while Multinomial is simply sampling from a set. It would be relatively easy to use Multinomial<Interval> to do what you want where Interval represents a half open interval (and allows +Inf as a right bound and -Inf as a left bound), but I think that you might gain something from the ability to split an interval that would make Multinomial irrelevant to your needs. With Multinomial<Interval>, one strategy would be to delete an interval and insert both halves which may be a bit more expensive than desired. Doing lots of deletions is also bad in Multinomial because it leaves an entry in place with zero probability (because I was lazy). Trying to mutate the Interval you are splitting so that it forms the left half of the new interval doesn't actually help because modifying the probability of an element just does a deletion and insertion. Better to use the first strategy. It would be very easy to add a garbage collection step that eliminates zero probability entries. As I mentioned, I was just lazy. Anyway, steal concept or code as you feel appropriate. On Mon, Jan 7, 2013 at 8:03 AM, Phil Steitz <phil.ste...@gmail.com> wrote: > The EmpiricalDistribution class in the random package is designed to > support large samples. It does not store all of data points in > memory, but instead bins the data and uses smoothing kernels within > the bins. I have recently had the need for a discrete empirical > distribution - i.e., an implementation that stores the full dataset > in memory and creates the empirical distribution exactly from it. > If there are no objections, I would like to open a JIRA and commit > o.a.c.m.distribution.DiscreteEmpiricalDistribution implementing > this. I will document the approach and design decisions in the JIRA > if others are OK with this addition. > > Phil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >