On 07/12/2015 04:58 PM, Phil Steitz wrote:
> On 7/12/15 2:50 AM, Thomas Neidhart wrote:
>> On 07/11/2015 09:43 PM, Phil Steitz wrote:
>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote:
>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote:
>>>>> The code implemented in MATH-1242 to improve performance of KS
>>>>> monteCarloP in-lines efficient generation of random boolean arrays.
>>>>>   Unfortunately, I think the implementation is not quite random (see
>>>>> comments on the ticket).  To verify it, we need to be able to test
>>>>> the random boolean array generation directly.  To do that, we have
>>>>> to either expose the method (at least as protected) in the KS class
>>>>> or add it somewhere else.  I propose the latter but am not sure
>>>>> where to put it.  For speed, we need to avoid array copies, so the
>>>>> API will have to be something like randomize(boolean[], nTrue).  It
>>>>> could go in the swelling MathArrays class, or RandomDataGenerator. 
>>>>> The latter probably makes more sense, but the API does not fit too
>>>>> well.  Any ideas?
>>>> If it is just for testing purposes, you can also access the method in
>>>> question via reflection, see an example here:
>>>> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes
>>> Do you think it *should* be a private method of the K-S class?
>> Right now, I do not see much uses outside the class, but if we decide to
>> make it public then I would prefer a special util class in the random
>> package to avoid cluttering the MathArrays class.
> 
> OK, for now we can make it private and use the reflection method
> above to test it.

ok, but I guess it is also fine to make it package private as sebb
suggested. We did something similar recently for some of the improved
sampling methods provided by Otmar.

>> Regarding the RandomDataGenerator: I think this class should be
>> deprecated and replaced by a Sampler interface as proposed by Gilles.
> 
> Please consider keeping this class.  Consider this a user request. 
> I have quite a few applications that use this class for two reasons:

ok, the reason why I thought the class should be deprecated is because
it was not kept up-to-date with all the new discrete and continuous
distributions that we added in the last 2-3 years. If you think it is
useful, then we can keep it of course.

> 1.  One object instance tied to one PRNG that generates data from
> multiple different distributions.   This is convenient.   Sure, I
> could refactor all of these apps to instantiate new objects for each
> type of generated data and hopefully still be able to peg them to
> one PRNG; but that is needless work that also complicates the code.
> 
> 2.  There are quite a few methods in this class that have nothing to
> do with sampling (nextPermutation, nextHexString, nextSecureXxx,
> etc) but which conveniently share the RandomGenerator.  I guess the
> utility methods get moved out somewhere else.  Again, I end up
> having to refactor all of my code that uses them and when I want
> simulations to be based on a single PRNG, I have to find a way to
> pass the RandomGenerator around to them.
> 
> I don't yet see the need to refactor the sampling support in the
> distributions package; but as my own apps are not impacted by this,
> if everyone else sees the user impact of the refactoring as
> outweighed by the benefit, I won't stand in the way.   Please lets
> just keep the RandomDataGenerator convenience class in the random
> package in any case.  I will take care of whatever adjustments are
> needed to adapt to whatever we settle on for sampling in the
> distributions package.

Well, it is not really necessary to do everything together and refactor
the distributions.

Probably it is better to start the other way round, and describe what I
want to add, and see how other things fit in:

 * I want a generic Sampler interface, i.e. something like this:
 ** nextSample()
 ** nextSamples(int size)
 ** nextSamples(double[] samples)

there could be a DiscreteSampler and ContinuousSampler interface to
handle the cases for int / double values.

The distributions could either be changed to return such a sampler as
Gilles proposed (with the advantage that no random instance is tied to
the distribution itself), or implement the interface directly (with the
advantage that we would not need to refactor too much).

>> One can then create a sampler for any distribution or from other
>> sources, e.g. when needing a fast and efficient sampler without
>> replacement (see MATH-1239).
> 
> +1 for sequential sampling.  I don't follow exactly why that
> requires refactoring the distributions; but if it helps in a way I
> don't yet understand, that will help convince me that refactoring
> sampling in the distributions package is worth the user pain.  

as I said above, I wanted to combine two things in one step, maybe it is
better to go step by step.

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to