Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions

Tim Düsterhus Tue, 18 Oct 2022 10:22:33 -0700

Hi

On 10/16/22 22:24, Dan Ackroyd wrote:

Shall an option be added to getFloat() that changes the logic to

select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?


An enum would probably be nice, and possibly be for all four cases of
min_(inclusive|exclusive)_max_(inclusive|exclusive) unless there is a
technical reason to not include all of them.

No technical reason. The paper for the γ-section algorithm by Prof.Goualard includes implementations for all four combinations.

The [closed, open) and [closed, closed] variants together are the mostuseful combination, though. The former can be cleanly split intosubintervals, whereas the latter is symmetric which is useful if youruse case is as well.

With rejection sampling the [closed, closed] variant can also be turnedinto any of the other three without introducing any bias.

Generating a random string containing specific characters...thus requires 
multiple lines of code for what effectively is a very simple operation.


Yeah, though those lines of code add distinction and emphasis for is
meant by character.

In particular, users might be surprised when they give this string
"abc😋👨‍👩‍👦"* and get a non-ascii result.

That's why the method includes 'bytes' in its name. This term is alsoused in ->shuffleBytes() which was renamed inhttps://wiki.php.net/rfc/random_extension_improvement due to this exactproblem.

In fact ->shuffleBytes() can be considered the companion method to theproposed ->getBytesFromAlphabet():

->shuffleBytes() allows to simulate a Multivariate hypergeometricdistribution ("sampling without replacement") by shuffling the inputstring and then using 'substr' on the result to select a number of bytes'n'.

->getBytesFromAlphabet() directly maps to a Multinomial distribution("replacement sampling").

You're going to need to be really precise on the naming and I'm not at
all sure there is a single version that would be useful enough to
belong in core.

Personally I believe the proposed ->getBytesFromAlphabet() to be usefulenough to belong in core. There are quite a few use cases that arenaturally restricted to ASCII: Basically everything that can beconsidered an identifier.


Arbitrary numeric strings alone likely provide plenty of use cases:

- Backup codes for multi-factor authentication (restricting the input todigits allows you to leverage a numeric keyboard, reducing the chancefor input errors).

- Random phone numbers for testing purposes.

- Random credit card numbers (you just need to calculate the checksumyourself).

While hexadecimal strings can easily be generated by applying bin2hex tothe output of ->getBytes(), that is also unintuitive, because the resultlength is twice the number of bytes.

whereas a 64 Bit engine could generate randomness for 8 characters at once.


I'm really not sure that many programs are going to be speed limited
by random number generation.

The syscall cost to retrieve bytes from the 'Secure' engine (which isthe default engine) can be expensive. Especially on older operatingsystem versions and depending on how many more Meltdown/Spectre-stylevulnerabilities they find.

A userland implementation that generates 1000 random numeric stringswith 100 characters each using the 'Secure' engine requires 146ms on mycomputer. The native implementation without optimization requires 101msand the optimized native implementation 26ms.

For Xoshiro256** (the fastest engine) the numbers are 89ms, 7ms and 3msrespectively.


Benchmark attached.

For those that are, writing their own generator to consume all 64 bits
of randomness for each call sounds reasonably sensible, unless a
useful general api can be thought of.

This cannot be reasonably done in userland, because you pay an increasedcost to turn the bytes into numbers and then to perform the necessarybit fiddling to debias the numbers.

For the float side of the RFC, as there are technical limitations on
which platforms it would be usable on, there needs to be a way of
determining whether the nextFloat and getFloat methods are going to
work. The way this is done on Imagick is to put appropriate defines in
the stub file and in the C code implementations so that the methods
aren't available on the class for the platforms where it isn't going
to function correctly.

I made a PR for that to Tim's repo, though I don't know of an
environment where it can be tested.

I've seent he PR and we shortly discussed this in chat. I would deferthis to the code review, because this amounts to an implementationdetail. Especially since all reasonable server platforms use IEEE 754.


Best regards
Tim Düsterhus

<<attachment: alphabet-benchmark.php>>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions

Reply via email to