On Sat, Nov 5, 2022 at 9:00 AM Tim Düsterhus <t...@bastelstu.be> wrote:

>
> Likewise if you generate a random float between 0 and 1000 with this
> method, some values will appear more often than others due to rounding
> and the changing density of floats for each power of two.
>
> With the γ-section algorithm by Prof. Goualard all these issues are
> eliminated and that's what is used for getFloat(). The getFloat() method
> supports all 4 possible boundary combinations to ensure that users have
> a safe solution for all possible use cases, so that they don't need to
> build an unsafe solution in userland.
>
>
This is a much more subtle problem than many people realize, and I'm sure
there are people who will vote or use this feature that don't have a deep
understanding of either the float/double type, or of the math behind this,
so I'm going to re-explain this using some analogies and examples.

**NOTE**: This is mainly for historical record if someone looks back on
this internals thread, not in response to any individual in the thread. I'm
trying to lay out in language that any passer-by could follow what the
issue is and why it is important. Accordingly, I'll be referencing base-10
numbers instead of base-2 numbers through most of this for clarity, but the
ideas behind it apply to the base-2 numbers that are actually used, just
with different boundaries at powers of 2 instead of powers of 10.

Suppose you have a number type that can store exactly 10 digits. Well over
the interval [0, 1] you can represent any number between 0.000 000 000 and
1.000 000 000 without issue. In fact, since you can store exactly 10
digits, you can represent any number between 0.000 000 000 and 9.999 999
999.

But what happens if you want the interval [0, 10]? Well, at 10 you can only
represent the digits 10.000 000 00, since you have to use one of the digits
that used to represent the decimal part to represent the whole part,
because you have a fixed amount of digits you can use. In order to
represent larger numbers, you lose the ability to represent certain numbers
that you used to be able to represent. The "density" of numbers has gone
down.

So if you tell the program "I want a random number, with decimal places,
between 0 and 1000", it runs into the following dilemma: between [0, 10),
(this means the interval that includes 0 but does not include 10), you have
1 billion values between each whole number which could be chosen. However,
for the numbers between [10, 100), you only have 100 million values between
each whole number that could be chosen, because you lost a digit after the
decimal point. This means that if you do a naive multiplication or
selection of your number, the intervals between 0 to 10 are individually 10
times more likely to be chosen than any of the individual intervals between
10 to 100, because those intervals have 10 times as many values mapped to
them.

The result would actually be nearly equally likely to be in the range [0,
10) as it would [10, 100), (because there are nearly as many possible
values between [0, 10) as there are between [10, 100)), even though
mathematically the second range is nearly 10 times the size of the first
range. Each possible representable value would be equally likely to be
chosen, but because the *density* of values is different at different parts
of the range, this actually skews the probability of a number landing
within an arbitrary part of the range of outputs.

This means that if your code does something like `if ($rand->getFloat(0,
1000) < 500)`, you're NOT actually going to get a 50% chance of `true` and
a 50% chance of `false`. In fact, using the naive way of mapping floats
that Tim mentioned (with base-10 math instead of base-2), you'd get `true`
*over two thirds of the time*.

Accounting for this is not very easy. What proportion of your total result
set has an expanded representable value? In my example, that might be easy
to calculate, but the actual math for a program like this is done in
base-2, not base-10. To actually calculate this, you need all sorts of logs
you need to take to figure out how many powers of 2 you're spanning, and
how that affects the probability of different intervals. It is *extremely*
easy to get this math wrong, even if you know what you're doing. As Tim
said, virtually no implementations that exist *actually* do this correctly.

Also consider that the requested range may not nicely straddle a power of
2. What if you want a float for the interval [5, 10]? Neither 5 nor 10 sit
cleanly on a power of 2, so how does that affect your math?

Like cryptography and security, this is an area where very specific and
very confident expertise is necessary, and it is highly preferable to have
everyone use a single vetted solution than roll their own. This makes it
very poor as something to leave to userland, and makes it *highly*
desirable to include it in core.

It's very difficult to do, but there is also only one actually correct
result, which makes it perfect for inclusion in core.

Jordan

Reply via email to