I was not only thinking of mathematical figures, I was also thinking of
graphics, some format may be zip containing XML stuff for instance.
But we don't need it here, so why should we care about it too much?
I was just digressing about the main subject:-) Having some graphics in
the doc would help here and there, though.
I do understand that. I'm trying to explain that "threshold" is in fact
completely disconnected from min and max, as the transformation scales the
data to [-1,1] like this
2.0 * (i - min - mu + 0.5) / (max - min + 1)
and only then the 'threshold' coefficient is applied. And if I read the
Box-Muller transformation correctly, it generates data with standard Normal
distribution from [-threshold, threshold] and then transforms them to the
right mean etc.
Yep, the threshold parameter is designed to be somehow independent of the
actual [min max] range.
But maybe that's what the first sentence is trying to say? I mean this:
For a Gaussian distribution, the interval is mapped onto a standard
normal distribution (the classical bell-shaped Gaussian curve)
truncated at -threshold on the left and +threshold on the right.
Yep, that looks like it.
I'm asking about this because it wasn't to me immediately clear whether I
need to tweak this for data sets with different scales, but apparently not.
Indeed, This is the idea of how the parameter is used.
After reading the docs again I think that's also clear from last sentence
that relates threshold and 67% and 95%.
Yep.
Anyway, the references to "standard normal distribution" are a bit sloppy -
"standard" usually means normal distribution with exactly mu=0 and sigma=1.
So it's a bit strange to say
standard normal distribution, with mean mu defined as (max+min)/2.0
because that's not a standard normal distribution at all. I propose to fix
this by removing the "standard".
Hmmm, probably fine if it is both more precise and shorter!
[...]
CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0)
and then the probability of "i" is
P(X=i) = CDF2(i+0.5) - CDF2(i-0.5)
I agree that defining the shifted/scaled CDF and using it afterwards looks
cleaner.
Which is what I meant by simplifying the equation. Not that it'd make easier
to imagine the shape, though ...
Sure. This is the part about providing the "precise" information, what is
the actual probability of drawing i depending on the parameters.
Maybe. Another thing is that "middle quarter" and "middle half" seems a bit
strange - if you split data into 1/4s there's no middle one (sure, I
understand what the sentence is meant to say).
Improvements are welcome!
Ok. I think that the fact that it relies on the Box-Muller transform is
relevant, because there are other methods to generate a gaussian
distribution, and I would say that there is no reason to have to go to
the source code to check that. But I would not provide further details.
So I'm fine with the current status.
There are alternative methods for almost every non-trivial piece of code, and
we generally don't mention that in user docs. Why should we mention it in
this case? Why would the user care which particular PRNG was used to generate
the numbers? Maybe there really is a reason for that, I don't know.
If that was security, because one has just been announced to be broken and
you want to know whether you depend on it.
As a scientist, I like it when follow scientists who achieved useful
things have their name cited:-).
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers