On 7/13/15 12:16 PM, Otmar Ertl wrote: > On Mon, Jul 13, 2015 at 3:51 PM, Gilles <gil...@harfang.homelinux.org> > wrote: >> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote: >>> On 7/12/15 9:45 PM, Otmar Ertl wrote: >>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <gil...@harfang.homelinux.org> >>>> wrote: >>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote: >>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote: >>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote: >>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote: >>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote: >>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote: >>>>>>>>>>> The code implemented in MATH-1242 to improve performance of KS >>>>>>>>>>> monteCarloP in-lines efficient generation of random boolean >>>>>>>>>>> arrays. >>>>>>>>>>> Unfortunately, I think the implementation is not quite random >>>>>>>>>>> (see >>>>>>>>>>> comments on the ticket). To verify it, we need to be able to test >>>>>>>>>>> the random boolean array generation directly. To do that, we have >>>>>>>>>>> to either expose the method (at least as protected) in the KS >>>>>>>>>>> class >>>>>>>>>>> or add it somewhere else. I propose the latter but am not sure >>>>>>>>>>> where to put it. For speed, we need to avoid array copies, so the >>>>>>>>>>> API will have to be something like randomize(boolean[], nTrue). >>>>>>>>>>> It >>>>>>>>>>> could go in the swelling MathArrays class, or > RandomDataGenerator. >>>>>>>>>>> The latter probably makes more sense, but the API does not fit > too >>>>>>>>>>> well. Any ideas? >>>>>>>>>> If it is just for testing purposes, you can also access the method >>>>>>>>>> in >>>>>>>>>> question via reflection, see an example here: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> > http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes >>>>>>>>> Do you think it *should* be a private method of the K-S class? >>>>>>>> Right now, I do not see much uses outside the class, but if we > decide >>>>>>>> to >>>>>>>> make it public then I would prefer a special util class in the > random >>>>>>>> package to avoid cluttering the MathArrays class. >>>>>>> >>>>>>> OK, for now we can make it private and use the reflection method >>>>>>> above to test it. >>>>>> >>>>>> ok, but I guess it is also fine to make it package private as sebb >>>>>> suggested. We did something similar recently for some of the improved >>>>>> sampling methods provided by Otmar. >>>>>> >>>>>>>> Regarding the RandomDataGenerator: I think this class should be >>>>>>>> deprecated and replaced by a Sampler interface as proposed by > Gilles. >>>>>>> >>>>>>> Please consider keeping this class. Consider this a user request. >>>>>>> I have quite a few applications that use this class for two reasons: >>>>>> >>>>>> ok, the reason why I thought the class should be deprecated is because >>>>>> it was not kept up-to-date with all the new discrete and continuous >>>>>> distributions that we added in the last 2-3 years. If you think it is >>>>>> useful, then we can keep it of course. >>>>>> >>>>>>> 1. One object instance tied to one PRNG that generates data from >>>>>>> multiple different distributions. This is convenient. Sure, I >>>>>>> could refactor all of these apps to instantiate new objects for each >>>>>>> type of generated data and hopefully still be able to peg them to >>>>>>> one PRNG; but that is needless work that also complicates the code. >>>>>>> >>>>>>> 2. There are quite a few methods in this class that have nothing to >>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx, >>>>>>> etc) but which conveniently share the RandomGenerator. I guess the >>>>>>> utility methods get moved out somewhere else. Again, I end up >>>>>>> having to refactor all of my code that uses them and when I want >>>>>>> simulations to be based on a single PRNG, I have to find a way to >>>>>>> pass the RandomGenerator around to them. >>>>>>> >>>>>>> I don't yet see the need to refactor the sampling support in the >>>>>>> distributions package; but as my own apps are not impacted by this, >>>>>>> if everyone else sees the user impact of the refactoring as >>>>>>> outweighed by the benefit, I won't stand in the way. Please lets >>>>>>> just keep the RandomDataGenerator convenience class in the random >>>>>>> package in any case. I will take care of whatever adjustments are >>>>>>> needed to adapt to whatever we settle on for sampling in the >>>>>>> distributions package. >>>>>> >>>>>> Well, it is not really necessary to do everything together and > refactor >>>>>> the distributions. >>>>>> >>>>>> Probably it is better to start the other way round, and describe what > I >>>>>> want to add, and see how other things fit in: >>>>>> >>>>>> * I want a generic Sampler interface, i.e. something like this: >>>>>> ** nextSample() >>>>>> ** nextSamples(int size) >>>>>> ** nextSamples(double[] samples) >>>>> >>>>> +1 >>>>> > Do we really need all those 3 methods? If the functionality provided by the > two latter methods is essential for convenience reasons, we could also > offer utility functions that are able to fill an array with random numbers > from a given sampler, e.g. MathArrays.fill(array, sampler) or > array=MathArrays.generate(sampler, size).
In some cases, there may be distribution-specific algorithms for generating sequences of values that are more efficient than just calling sample() repeatedly. So while providing a default impl in an abstract sampler that just calls sample() repeatedly would make sense; it might also make sense to allow the distribution to override with something more efficient. An example (not exposed now) is sampling from a Gaussian distribution. You can see that under the covers, the BitStreamGenerator generates these in pairs. > >>>>>> there could be a DiscreteSampler and ContinuousSampler interface to >>>>>> handle the cases for int / double values. >>>>> >>>>> Perhaps the name should be IntegerSampler and DoubleSampler, to >>>>> accomodate >>>>> future needs (LongSampler, BooleanSampler (?)). >>>> I would prefer consistent names for distributions and corresponding >>>> samplers. The support of distributions must be of the same data type >>>> as the return values of the corresponding sampler. Therefore, I would >>>> call the samplers for RealDistribution and IntegerDistribution >>>> RealSampler and IntegerSampler, respectively. >> >> Ideally yes. But there is always the lingering question of what "Real" >> means: the mathematical abstraction or the numerical representation? >> The same "real" distributions could be implemented with "float"s. >> If we ever need implementing a discrete distribution that is able to >> use the "long" range, the "Integer" in "IntegerSampler" will clearly >> refer to the numerical type, not the mathematical abstraction of an >> "integer" number... >> >> Gilles >> >> > The underlying data type of RealDistribution implementations is implicitly > defined to be "double". It would not make sense that methods such as > getSupportLowerBound() return a different type than the associated sample > method. Therefore, in the context of RealDistribution "real" means "double" > at the moment. However, I am open for renaming RealDistribution as > DoubleDistribution with a sampler named DoubleSampler. If a float-based > implementation is really ever needed, interfaces named FloatDistribution > and FloatSampler could be introduced. Let's not change the names just to change them. There is discussion in the archives that led us to call what is traditionally referred to as continuous, "Real" - mostly because we support some not absolutely continuous distribution functions among the non-discrete distributions. We call discrete "Integer" because we force all discrete domains to map into the integers. In the 10+ years we have offered both, we have never had a request to either support floats as representations of reals or any discrete distribution that requires the full Long value set. Phil >>> +1 >>> >>> Phil >>>> >>>>>> The distributions could either be changed to return such a sampler as >>>>>> Gilles proposed (with the advantage that no random instance is tied to >>>>>> the distribution itself), or implement the interface directly (with > the >>>>>> advantage that we would not need to refactor too much). >>>>> >>>>> Unless I'm missing something, the refactoring would be fairly the same: >>>>> The latter case needs implementing 3 methods (2 new ones, one with a >>>>> name >>>>> change). >>>>> The former needs implementing the factory method proposed in MATH-1158, >>>>> plus the same methods as above (wrapped in the object returned by the >>>>> factory method). >>>>> >>>>> >>>>> Gilles >>>>> >>>>> >>>>> >>>>>>>> One can then create a sampler for any distribution or from other >>>>>>>> sources, e.g. when needing a fast and efficient sampler without >>>>>>>> replacement (see MATH-1239). >>>>>>> >>>>>>> +1 for sequential sampling. I don't follow exactly why that >>>>>>> requires refactoring the distributions; but if it helps in a way I >>>>>>> don't yet understand, that will help convince me that refactoring >>>>>>> sampling in the distributions package is worth the user pain. >>>>>> >>>>>> as I said above, I wanted to combine two things in one step, maybe it >>>>>> is >>>>>> better to go step by step. >>>>>> >>>>>> Thomas >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>> >>>> Otmar >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>> For additional commands, e-mail: dev-h...@commons.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> > Otmar > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org