On 7/13/15 12:55 PM, Phil Steitz wrote: > On 7/13/15 12:16 PM, Otmar Ertl wrote: >> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <gil...@harfang.homelinux.org> >> wrote: >>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote: >>>> On 7/12/15 9:45 PM, Otmar Ertl wrote: >>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <gil...@harfang.homelinux.org> >>>>> wrote: >>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote: >>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote: >>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote: >>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote: >>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote: >>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote: >>>>>>>>>>>> The code implemented in MATH-1242 to improve performance of KS >>>>>>>>>>>> monteCarloP in-lines efficient generation of random boolean >>>>>>>>>>>> arrays. >>>>>>>>>>>> Unfortunately, I think the implementation is not quite random >>>>>>>>>>>> (see >>>>>>>>>>>> comments on the ticket). To verify it, we need to be able to test >>>>>>>>>>>> the random boolean array generation directly. To do that, we have >>>>>>>>>>>> to either expose the method (at least as protected) in the KS >>>>>>>>>>>> class >>>>>>>>>>>> or add it somewhere else. I propose the latter but am not sure >>>>>>>>>>>> where to put it. For speed, we need to avoid array copies, so the >>>>>>>>>>>> API will have to be something like randomize(boolean[], nTrue). >>>>>>>>>>>> It >>>>>>>>>>>> could go in the swelling MathArrays class, or >> RandomDataGenerator. >>>>>>>>>>>> The latter probably makes more sense, but the API does not fit >> too >>>>>>>>>>>> well. Any ideas? >>>>>>>>>>> If it is just for testing purposes, you can also access the method >>>>>>>>>>> in >>>>>>>>>>> question via reflection, see an example here: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes >>>>>>>>>> Do you think it *should* be a private method of the K-S class? >>>>>>>>> Right now, I do not see much uses outside the class, but if we >> decide >>>>>>>>> to >>>>>>>>> make it public then I would prefer a special util class in the >> random >>>>>>>>> package to avoid cluttering the MathArrays class. >>>>>>>> OK, for now we can make it private and use the reflection method >>>>>>>> above to test it. >>>>>>> ok, but I guess it is also fine to make it package private as sebb >>>>>>> suggested. We did something similar recently for some of the improved >>>>>>> sampling methods provided by Otmar. >>>>>>> >>>>>>>>> Regarding the RandomDataGenerator: I think this class should be >>>>>>>>> deprecated and replaced by a Sampler interface as proposed by >> Gilles. >>>>>>>> Please consider keeping this class. Consider this a user request. >>>>>>>> I have quite a few applications that use this class for two reasons: >>>>>>> ok, the reason why I thought the class should be deprecated is because >>>>>>> it was not kept up-to-date with all the new discrete and continuous >>>>>>> distributions that we added in the last 2-3 years. If you think it is >>>>>>> useful, then we can keep it of course. >>>>>>> >>>>>>>> 1. One object instance tied to one PRNG that generates data from >>>>>>>> multiple different distributions. This is convenient. Sure, I >>>>>>>> could refactor all of these apps to instantiate new objects for each >>>>>>>> type of generated data and hopefully still be able to peg them to >>>>>>>> one PRNG; but that is needless work that also complicates the code. >>>>>>>> >>>>>>>> 2. There are quite a few methods in this class that have nothing to >>>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx, >>>>>>>> etc) but which conveniently share the RandomGenerator. I guess the >>>>>>>> utility methods get moved out somewhere else. Again, I end up >>>>>>>> having to refactor all of my code that uses them and when I want >>>>>>>> simulations to be based on a single PRNG, I have to find a way to >>>>>>>> pass the RandomGenerator around to them. >>>>>>>> >>>>>>>> I don't yet see the need to refactor the sampling support in the >>>>>>>> distributions package; but as my own apps are not impacted by this, >>>>>>>> if everyone else sees the user impact of the refactoring as >>>>>>>> outweighed by the benefit, I won't stand in the way. Please lets >>>>>>>> just keep the RandomDataGenerator convenience class in the random >>>>>>>> package in any case. I will take care of whatever adjustments are >>>>>>>> needed to adapt to whatever we settle on for sampling in the >>>>>>>> distributions package. >>>>>>> Well, it is not really necessary to do everything together and >> refactor >>>>>>> the distributions. >>>>>>> >>>>>>> Probably it is better to start the other way round, and describe what >> I >>>>>>> want to add, and see how other things fit in: >>>>>>> >>>>>>> * I want a generic Sampler interface, i.e. something like this: >>>>>>> ** nextSample() >>>>>>> ** nextSamples(int size) >>>>>>> ** nextSamples(double[] samples) >>>>>> +1 >>>>>> >> Do we really need all those 3 methods? If the functionality provided by the >> two latter methods is essential for convenience reasons, we could also >> offer utility functions that are able to fill an array with random numbers >> from a given sampler, e.g. MathArrays.fill(array, sampler) or >> array=MathArrays.generate(sampler, size). > In some cases, there may be distribution-specific algorithms for > generating sequences of values that are more efficient than just > calling sample() repeatedly. So while providing a default impl in > an abstract sampler that just calls sample() repeatedly would make > sense; it might also make sense to allow the distribution to > override with something more efficient. An example (not exposed > now) is sampling from a Gaussian distribution. You can see that > under the covers, the BitStreamGenerator generates these in pairs. >>>>>>> there could be a DiscreteSampler and ContinuousSampler interface to >>>>>>> handle the cases for int / double values. >>>>>> Perhaps the name should be IntegerSampler and DoubleSampler, to >>>>>> accomodate >>>>>> future needs (LongSampler, BooleanSampler (?)). >>>>> I would prefer consistent names for distributions and corresponding >>>>> samplers. The support of distributions must be of the same data type >>>>> as the return values of the corresponding sampler. Therefore, I would >>>>> call the samplers for RealDistribution and IntegerDistribution >>>>> RealSampler and IntegerSampler, respectively. >>> Ideally yes. But there is always the lingering question of what "Real" >>> means: the mathematical abstraction or the numerical representation? >>> The same "real" distributions could be implemented with "float"s. >>> If we ever need implementing a discrete distribution that is able to >>> use the "long" range, the "Integer" in "IntegerSampler" will clearly >>> refer to the numerical type, not the mathematical abstraction of an >>> "integer" number... >>> >>> Gilles >>> >>> >> The underlying data type of RealDistribution implementations is implicitly >> defined to be "double". It would not make sense that methods such as >> getSupportLowerBound() return a different type than the associated sample >> method. Therefore, in the context of RealDistribution "real" means "double" >> at the moment. However, I am open for renaming RealDistribution as >> DoubleDistribution with a sampler named DoubleSampler. If a float-based >> implementation is really ever needed, interfaces named FloatDistribution >> and FloatSampler could be introduced. > Let's not change the names just to change them. There is discussion > in the archives that led us to call what is traditionally referred > to as continuous, "Real" - mostly because we support some not > absolutely continuous distribution functions among the non-discrete > distributions. We call discrete "Integer" because we force all > discrete domains to map into the integers. In the 10+ years we have > offered both, we have never had a request to either support floats > as representations of reals or any discrete distribution that > requires the full Long value set. More precisely the full Long value set as actual parameters to distribution functions. We already support distributions with infinite value sets.
Phil > > Phil >>>> +1 >>>> >>>> Phil >>>>>>> The distributions could either be changed to return such a sampler as >>>>>>> Gilles proposed (with the advantage that no random instance is tied to >>>>>>> the distribution itself), or implement the interface directly (with >> the >>>>>>> advantage that we would not need to refactor too much). >>>>>> Unless I'm missing something, the refactoring would be fairly the same: >>>>>> The latter case needs implementing 3 methods (2 new ones, one with a >>>>>> name >>>>>> change). >>>>>> The former needs implementing the factory method proposed in MATH-1158, >>>>>> plus the same methods as above (wrapped in the object returned by the >>>>>> factory method). >>>>>> >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> >>>>>>>>> One can then create a sampler for any distribution or from other >>>>>>>>> sources, e.g. when needing a fast and efficient sampler without >>>>>>>>> replacement (see MATH-1239). >>>>>>>> +1 for sequential sampling. I don't follow exactly why that >>>>>>>> requires refactoring the distributions; but if it helps in a way I >>>>>>>> don't yet understand, that will help convince me that refactoring >>>>>>>> sampling in the distributions package is worth the user pain. >>>>>>> as I said above, I wanted to combine two things in one step, maybe it >>>>>>> is >>>>>>> better to go step by step. >>>>>>> >>>>>>> Thomas >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>>> >>>>> Otmar >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>> For additional commands, e-mail: dev-h...@commons.apache.org >>> >> Otmar >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org