Re: [math] random boolean arrays

Phil Steitz Mon, 13 Jul 2015 13:01:41 -0700

On 7/13/15 12:55 PM, Phil Steitz wrote:
> On 7/13/15 12:16 PM, Otmar Ertl wrote:
>> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <gil...@harfang.homelinux.org>
>> wrote:
>>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote:
>>>> On 7/12/15 9:45 PM, Otmar Ertl wrote:
>>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <gil...@harfang.homelinux.org>
>>>>> wrote:
>>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote:
>>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote:
>>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote:
>>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote:
>>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote:
>>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote:
>>>>>>>>>>>> The code implemented in MATH-1242 to improve performance of KS
>>>>>>>>>>>> monteCarloP in-lines efficient generation of random boolean
>>>>>>>>>>>> arrays.
>>>>>>>>>>>> Unfortunately, I think the implementation is not quite random
>>>>>>>>>>>> (see
>>>>>>>>>>>> comments on the ticket). To verify it, we need to be able to test
>>>>>>>>>>>> the random boolean array generation directly. To do that, we have
>>>>>>>>>>>> to either expose the method (at least as protected) in the KS
>>>>>>>>>>>> class
>>>>>>>>>>>> or add it somewhere else. I propose the latter but am not sure
>>>>>>>>>>>> where to put it. For speed, we need to avoid array copies, so the
>>>>>>>>>>>> API will have to be something like randomize(boolean[], nTrue).
>>>>>>>>>>>> It
>>>>>>>>>>>> could go in the swelling MathArrays class, or
>> RandomDataGenerator.
>>>>>>>>>>>> The latter probably makes more sense, but the API does not fit
>> too
>>>>>>>>>>>> well. Any ideas?
>>>>>>>>>>> If it is just for testing purposes, you can also access the method
>>>>>>>>>>> in
>>>>>>>>>>> question via reflection, see an example here:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes
>>>>>>>>>> Do you think it *should* be a private method of the K-S class?
>>>>>>>>> Right now, I do not see much uses outside the class, but if we
>> decide
>>>>>>>>> to
>>>>>>>>> make it public then I would prefer a special util class in the
>> random
>>>>>>>>> package to avoid cluttering the MathArrays class.
>>>>>>>> OK, for now we can make it private and use the reflection method
>>>>>>>> above to test it.
>>>>>>> ok, but I guess it is also fine to make it package private as sebb
>>>>>>> suggested. We did something similar recently for some of the improved
>>>>>>> sampling methods provided by Otmar.
>>>>>>>
>>>>>>>>> Regarding the RandomDataGenerator: I think this class should be
>>>>>>>>> deprecated and replaced by a Sampler interface as proposed by
>> Gilles.
>>>>>>>> Please consider keeping this class. Consider this a user request.
>>>>>>>> I have quite a few applications that use this class for two reasons:
>>>>>>> ok, the reason why I thought the class should be deprecated is because
>>>>>>> it was not kept up-to-date with all the new discrete and continuous
>>>>>>> distributions that we added in the last 2-3 years. If you think it is
>>>>>>> useful, then we can keep it of course.
>>>>>>>
>>>>>>>> 1. One object instance tied to one PRNG that generates data from
>>>>>>>> multiple different distributions. This is convenient. Sure, I
>>>>>>>> could refactor all of these apps to instantiate new objects for each
>>>>>>>> type of generated data and hopefully still be able to peg them to
>>>>>>>> one PRNG; but that is needless work that also complicates the code.
>>>>>>>>
>>>>>>>> 2. There are quite a few methods in this class that have nothing to
>>>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx,
>>>>>>>> etc) but which conveniently share the RandomGenerator. I guess the
>>>>>>>> utility methods get moved out somewhere else. Again, I end up
>>>>>>>> having to refactor all of my code that uses them and when I want
>>>>>>>> simulations to be based on a single PRNG, I have to find a way to
>>>>>>>> pass the RandomGenerator around to them.
>>>>>>>>
>>>>>>>> I don't yet see the need to refactor the sampling support in the
>>>>>>>> distributions package; but as my own apps are not impacted by this,
>>>>>>>> if everyone else sees the user impact of the refactoring as
>>>>>>>> outweighed by the benefit, I won't stand in the way. Please lets
>>>>>>>> just keep the RandomDataGenerator convenience class in the random
>>>>>>>> package in any case. I will take care of whatever adjustments are
>>>>>>>> needed to adapt to whatever we settle on for sampling in the
>>>>>>>> distributions package.
>>>>>>> Well, it is not really necessary to do everything together and
>> refactor
>>>>>>> the distributions.
>>>>>>>
>>>>>>> Probably it is better to start the other way round, and describe what
>> I
>>>>>>> want to add, and see how other things fit in:
>>>>>>>
>>>>>>> * I want a generic Sampler interface, i.e. something like this:
>>>>>>> ** nextSample()
>>>>>>> ** nextSamples(int size)
>>>>>>> ** nextSamples(double[] samples)
>>>>>> +1
>>>>>>
>> Do we really need all those 3 methods? If the functionality provided by the
>> two latter methods is essential for convenience reasons, we could also
>> offer utility functions that are able to fill an array with random numbers
>> from a given sampler, e.g. MathArrays.fill(array, sampler) or
>> array=MathArrays.generate(sampler, size).
> In some cases, there may be distribution-specific algorithms for
> generating sequences of values that are more efficient than just
> calling sample() repeatedly.  So while providing a default impl in
> an abstract sampler that just calls sample() repeatedly would make
> sense; it might also make sense to allow the distribution to
> override with something more efficient.   An example (not exposed
> now) is sampling from a Gaussian distribution.  You can see that
> under the covers, the BitStreamGenerator generates these in pairs.
>>>>>>> there could be a DiscreteSampler and ContinuousSampler interface to
>>>>>>> handle the cases for int / double values.
>>>>>> Perhaps the name should be IntegerSampler and DoubleSampler, to
>>>>>> accomodate
>>>>>> future needs (LongSampler, BooleanSampler (?)).
>>>>> I would prefer consistent names for distributions and corresponding
>>>>> samplers. The support of distributions must be of the same data type
>>>>> as the return values of the corresponding sampler. Therefore, I would
>>>>> call the samplers for RealDistribution and IntegerDistribution
>>>>> RealSampler and IntegerSampler, respectively.
>>> Ideally yes. But there is always the lingering question of what "Real"
>>> means: the mathematical abstraction or the numerical representation?
>>> The same "real" distributions could be implemented with "float"s.
>>> If we ever need implementing a discrete distribution that is able to
>>> use the "long" range, the "Integer" in "IntegerSampler" will clearly
>>> refer to the numerical type, not the mathematical abstraction of an
>>> "integer" number...
>>>
>>> Gilles
>>>
>>>
>> The underlying data type of RealDistribution implementations is implicitly
>> defined to be "double". It would not make sense that methods such as
>> getSupportLowerBound() return a different type than the associated sample
>> method. Therefore, in the context of RealDistribution "real" means "double"
>> at the moment. However, I am open for renaming RealDistribution as
>> DoubleDistribution with a sampler named DoubleSampler. If a float-based
>> implementation is really ever needed, interfaces named FloatDistribution
>> and FloatSampler could be introduced.
> Let's not change the names just to change them.  There is discussion
> in the archives that led us to call what is traditionally referred
> to as continuous, "Real" - mostly because we support some not
> absolutely continuous distribution functions among the non-discrete
> distributions.  We call discrete "Integer" because we force all
> discrete domains to map into the integers.  In the 10+ years we have
> offered both, we have never had a request to either support floats
> as representations of reals or any discrete distribution that
> requires the full Long value set.
More precisely the full Long value set as actual parameters to
distribution functions.  We already support distributions with
infinite value sets.


Phil
>
> Phil
>>>> +1
>>>>
>>>> Phil
>>>>>>> The distributions could either be changed to return such a sampler as
>>>>>>> Gilles proposed (with the advantage that no random instance is tied to
>>>>>>> the distribution itself), or implement the interface directly (with
>> the
>>>>>>> advantage that we would not need to refactor too much).
>>>>>> Unless I'm missing something, the refactoring would be fairly the same:
>>>>>> The latter case needs implementing 3 methods (2 new ones, one with a
>>>>>> name
>>>>>> change).
>>>>>> The former needs implementing the factory method proposed in MATH-1158,
>>>>>> plus the same methods as above (wrapped in the object returned by the
>>>>>> factory method).
>>>>>>
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> One can then create a sampler for any distribution or from other
>>>>>>>>> sources, e.g. when needing a fast and efficient sampler without
>>>>>>>>> replacement (see MATH-1239).
>>>>>>>> +1 for sequential sampling. I don't follow exactly why that
>>>>>>>> requires refactoring the distributions; but if it helps in a way I
>>>>>>>> don't yet understand, that will help convince me that refactoring
>>>>>>>> sampling in the distributions package is worth the user pain.
>>>>>>> as I said above, I wanted to combine two things in one step, maybe it
>>>>>>> is
>>>>>>> better to go step by step.
>>>>>>>
>>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>
>>>>> Otmar
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>> Otmar
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] random boolean arrays

Reply via email to