On Mon, Jul 13, 2015 at 9:59 PM, Phil Steitz <phil.ste...@gmail.com> wrote: > On 7/13/15 12:55 PM, Phil Steitz wrote: >> On 7/13/15 12:16 PM, Otmar Ertl wrote: >>> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <gil...@harfang.homelinux.org> >>> wrote: >>>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote: >>>>> On 7/12/15 9:45 PM, Otmar Ertl wrote: >>>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <gil...@harfang.homelinux.org> >>>>>> wrote: >>>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote: >>>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote: >>>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote: >>>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote: >>>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote: >>>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote: >>>>>>>>>>>>> The code implemented in MATH-1242 to improve performance of KS >>>>>>>>>>>>> monteCarloP in-lines efficient generation of random boolean >>>>>>>>>>>>> arrays. >>>>>>>>>>>>> Unfortunately, I think the implementation is not quite random >>>>>>>>>>>>> (see >>>>>>>>>>>>> comments on the ticket). To verify it, we need to be able to test >>>>>>>>>>>>> the random boolean array generation directly. To do that, we have >>>>>>>>>>>>> to either expose the method (at least as protected) in the KS >>>>>>>>>>>>> class >>>>>>>>>>>>> or add it somewhere else. I propose the latter but am not sure >>>>>>>>>>>>> where to put it. For speed, we need to avoid array copies, so the >>>>>>>>>>>>> API will have to be something like randomize(boolean[], nTrue). >>>>>>>>>>>>> It >>>>>>>>>>>>> could go in the swelling MathArrays class, or >>> RandomDataGenerator. >>>>>>>>>>>>> The latter probably makes more sense, but the API does not fit >>> too >>>>>>>>>>>>> well. Any ideas? >>>>>>>>>>>> If it is just for testing purposes, you can also access the method >>>>>>>>>>>> in >>>>>>>>>>>> question via reflection, see an example here: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes >>>>>>>>>>> Do you think it *should* be a private method of the K-S class? >>>>>>>>>> Right now, I do not see much uses outside the class, but if we >>> decide >>>>>>>>>> to >>>>>>>>>> make it public then I would prefer a special util class in the >>> random >>>>>>>>>> package to avoid cluttering the MathArrays class. >>>>>>>>> OK, for now we can make it private and use the reflection method >>>>>>>>> above to test it. >>>>>>>> ok, but I guess it is also fine to make it package private as sebb >>>>>>>> suggested. We did something similar recently for some of the improved >>>>>>>> sampling methods provided by Otmar. >>>>>>>> >>>>>>>>>> Regarding the RandomDataGenerator: I think this class should be >>>>>>>>>> deprecated and replaced by a Sampler interface as proposed by >>> Gilles. >>>>>>>>> Please consider keeping this class. Consider this a user request. >>>>>>>>> I have quite a few applications that use this class for two reasons: >>>>>>>> ok, the reason why I thought the class should be deprecated is because >>>>>>>> it was not kept up-to-date with all the new discrete and continuous >>>>>>>> distributions that we added in the last 2-3 years. If you think it is >>>>>>>> useful, then we can keep it of course. >>>>>>>> >>>>>>>>> 1. One object instance tied to one PRNG that generates data from >>>>>>>>> multiple different distributions. This is convenient. Sure, I >>>>>>>>> could refactor all of these apps to instantiate new objects for each >>>>>>>>> type of generated data and hopefully still be able to peg them to >>>>>>>>> one PRNG; but that is needless work that also complicates the code. >>>>>>>>> >>>>>>>>> 2. There are quite a few methods in this class that have nothing to >>>>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx, >>>>>>>>> etc) but which conveniently share the RandomGenerator. I guess the >>>>>>>>> utility methods get moved out somewhere else. Again, I end up >>>>>>>>> having to refactor all of my code that uses them and when I want >>>>>>>>> simulations to be based on a single PRNG, I have to find a way to >>>>>>>>> pass the RandomGenerator around to them. >>>>>>>>> >>>>>>>>> I don't yet see the need to refactor the sampling support in the >>>>>>>>> distributions package; but as my own apps are not impacted by this, >>>>>>>>> if everyone else sees the user impact of the refactoring as >>>>>>>>> outweighed by the benefit, I won't stand in the way. Please lets >>>>>>>>> just keep the RandomDataGenerator convenience class in the random >>>>>>>>> package in any case. I will take care of whatever adjustments are >>>>>>>>> needed to adapt to whatever we settle on for sampling in the >>>>>>>>> distributions package. >>>>>>>> Well, it is not really necessary to do everything together and >>> refactor >>>>>>>> the distributions. >>>>>>>> >>>>>>>> Probably it is better to start the other way round, and describe what >>> I >>>>>>>> want to add, and see how other things fit in: >>>>>>>> >>>>>>>> * I want a generic Sampler interface, i.e. something like this: >>>>>>>> ** nextSample() >>>>>>>> ** nextSamples(int size) >>>>>>>> ** nextSamples(double[] samples) >>>>>>> +1 >>>>>>> >>> Do we really need all those 3 methods? If the functionality provided by the >>> two latter methods is essential for convenience reasons, we could also >>> offer utility functions that are able to fill an array with random numbers >>> from a given sampler, e.g. MathArrays.fill(array, sampler) or >>> array=MathArrays.generate(sampler, size). >> In some cases, there may be distribution-specific algorithms for >> generating sequences of values that are more efficient than just >> calling sample() repeatedly. So while providing a default impl in >> an abstract sampler that just calls sample() repeatedly would make >> sense; it might also make sense to allow the distribution to >> override with something more efficient. An example (not exposed >> now) is sampling from a Gaussian distribution. You can see that >> under the covers, the BitStreamGenerator generates these in pairs.
The simultaneous generation of multiple random values, as done by some algorithms (I only know the Box-Muller method), does not justify the extra interface methods. With negligible overhead the extra values can be cached and reused for subsequent calls. I prefer to keep interfaces as simple as possible. Otmar >>>>>>>> there could be a DiscreteSampler and ContinuousSampler interface to >>>>>>>> handle the cases for int / double values. >>>>>>> Perhaps the name should be IntegerSampler and DoubleSampler, to >>>>>>> accomodate >>>>>>> future needs (LongSampler, BooleanSampler (?)). >>>>>> I would prefer consistent names for distributions and corresponding >>>>>> samplers. The support of distributions must be of the same data type >>>>>> as the return values of the corresponding sampler. Therefore, I would >>>>>> call the samplers for RealDistribution and IntegerDistribution >>>>>> RealSampler and IntegerSampler, respectively. >>>> Ideally yes. But there is always the lingering question of what "Real" >>>> means: the mathematical abstraction or the numerical representation? >>>> The same "real" distributions could be implemented with "float"s. >>>> If we ever need implementing a discrete distribution that is able to >>>> use the "long" range, the "Integer" in "IntegerSampler" will clearly >>>> refer to the numerical type, not the mathematical abstraction of an >>>> "integer" number... >>>> >>>> Gilles >>>> >>>> >>> The underlying data type of RealDistribution implementations is implicitly >>> defined to be "double". It would not make sense that methods such as >>> getSupportLowerBound() return a different type than the associated sample >>> method. Therefore, in the context of RealDistribution "real" means "double" >>> at the moment. However, I am open for renaming RealDistribution as >>> DoubleDistribution with a sampler named DoubleSampler. If a float-based >>> implementation is really ever needed, interfaces named FloatDistribution >>> and FloatSampler could be introduced. >> Let's not change the names just to change them. There is discussion >> in the archives that led us to call what is traditionally referred >> to as continuous, "Real" - mostly because we support some not >> absolutely continuous distribution functions among the non-discrete >> distributions. We call discrete "Integer" because we force all >> discrete domains to map into the integers. In the 10+ years we have >> offered both, we have never had a request to either support floats >> as representations of reals or any discrete distribution that >> requires the full Long value set. > More precisely the full Long value set as actual parameters to > distribution functions. We already support distributions with > infinite value sets. > > Phil >> >> Phil >>>>> +1 >>>>> >>>>> Phil >>>>>>>> The distributions could either be changed to return such a sampler as >>>>>>>> Gilles proposed (with the advantage that no random instance is tied to >>>>>>>> the distribution itself), or implement the interface directly (with >>> the >>>>>>>> advantage that we would not need to refactor too much). >>>>>>> Unless I'm missing something, the refactoring would be fairly the same: >>>>>>> The latter case needs implementing 3 methods (2 new ones, one with a >>>>>>> name >>>>>>> change). >>>>>>> The former needs implementing the factory method proposed in MATH-1158, >>>>>>> plus the same methods as above (wrapped in the object returned by the >>>>>>> factory method). >>>>>>> >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> >>>>>>> >>>>>>>>>> One can then create a sampler for any distribution or from other >>>>>>>>>> sources, e.g. when needing a fast and efficient sampler without >>>>>>>>>> replacement (see MATH-1239). >>>>>>>>> +1 for sequential sampling. I don't follow exactly why that >>>>>>>>> requires refactoring the distributions; but if it helps in a way I >>>>>>>>> don't yet understand, that will help convince me that refactoring >>>>>>>>> sampling in the distributions package is worth the user pain. >>>>>>>> as I said above, I wanted to combine two things in one step, maybe it >>>>>>>> is >>>>>>>> better to go step by step. >>>>>>>> >>>>>>>> Thomas >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>>> >>>>>> Otmar >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>> Otmar >>> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org