Re: [HACKERS] pgbench gaussian/exponential docs improvements

Fabien COELHO Sun, 25 Oct 2015 23:30:12 -0700

I was not only thinking of mathematical figures, I was also thinking of
graphics, some format may be zip containing XML stuff for instance.


But we don't need it here, so why should we care about it too much?

I was just digressing about the main subject:-) Having some graphics inthe doc would help here and there, though.

I do understand that. I'm trying to explain that "threshold" is in factcompletely disconnected from min and max, as the transformation scales thedata to [-1,1] like this
   2.0 * (i - min - mu + 0.5) / (max - min + 1)
and only then the 'threshold' coefficient is applied. And if I read theBox-Muller transformation correctly, it generates data with standard Normaldistribution from [-threshold, threshold] and then transforms them to theright mean etc.

Yep, the threshold parameter is designed to be somehow independent of theactual [min max] range.

But maybe that's what the first sentence is trying to say? I mean this:

   For a Gaussian distribution, the interval is mapped onto a standard
   normal distribution (the classical bell-shaped Gaussian curve)
   truncated at -threshold on the left and +threshold on the right.


Yep, that looks like it.

I'm asking about this because it wasn't to me immediately clear whether Ineed to tweak this for data sets with different scales, but apparently not.


Indeed, This is the idea of how the parameter is used.

After reading the docs again I think that's also clear from last sentencethat relates threshold and 67% and 95%.


Yep.

Anyway, the references to "standard normal distribution" are a bit sloppy -"standard" usually means normal distribution with exactly mu=0 and sigma=1.So it's a bit strange to say
   standard normal distribution, with mean mu defined as (max+min)/2.0
because that's not a standard normal distribution at all. I propose to fixthis by removing the "standard".


Hmmm, probably fine if it is both more precise and shorter!

[...]
 CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0)

and then the probability of "i" is

 P(X=i) = CDF2(i+0.5) - CDF2(i-0.5)

I agree that defining the shifted/scaled CDF and using it afterwards lookscleaner.

Which is what I meant by simplifying the equation. Not that it'd make easierto imagine the shape, though ...

Sure. This is the part about providing the "precise" information, what isthe actual probability of drawing i depending on the parameters.

Maybe. Another thing is that "middle quarter" and "middle half" seems a bitstrange - if you split data into 1/4s there's no middle one (sure, Iunderstand what the sentence is meant to say).


Improvements are welcome!

Ok. I think that the fact that it relies on the Box-Muller transform is
relevant, because there are other methods to generate a gaussian
distribution, and I would say that there is no reason to have to go to
the source code to check that. But I would not provide further details.
So I'm fine with the current status.
There are alternative methods for almost every non-trivial piece of code, andwe generally don't mention that in user docs. Why should we mention it inthis case? Why would the user care which particular PRNG was used to generatethe numbers? Maybe there really is a reason for that, I don't know.

If that was security, because one has just been announced to be broken andyou want to know whether you depend on it.

As a scientist, I like it when follow scientists who achieved usefulthings have their name cited:-).


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench gaussian/exponential docs improvements

Reply via email to