Hi

On 10/16/22 22:24, Dan Ackroyd wrote:
Shall an option be added to getFloat() that changes the logic to
select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

An enum would probably be nice, and possibly be for all four cases of
min_(inclusive|exclusive)_max_(inclusive|exclusive) unless there is a
technical reason to not include all of them.

No technical reason. The paper for the γ-section algorithm by Prof. Goualard includes implementations for all four combinations.

The [closed, open) and [closed, closed] variants together are the most useful combination, though. The former can be cleanly split into subintervals, whereas the latter is symmetric which is useful if your use case is as well.

With rejection sampling the [closed, closed] variant can also be turned into any of the other three without introducing any bias.

Generating a random string containing specific characters...thus requires 
multiple lines of code for what effectively is a very simple operation.

Yeah, though those lines of code add distinction and emphasis for is
meant by character.

In particular, users might be surprised when they give this string
"abc😋👨‍👩‍👦"* and get a non-ascii result.

That's why the method includes 'bytes' in its name. This term is also used in ->shuffleBytes() which was renamed in https://wiki.php.net/rfc/random_extension_improvement due to this exact problem.

In fact ->shuffleBytes() can be considered the companion method to the proposed ->getBytesFromAlphabet():

->shuffleBytes() allows to simulate a Multivariate hypergeometric distribution ("sampling without replacement") by shuffling the input string and then using 'substr' on the result to select a number of bytes 'n'.

->getBytesFromAlphabet() directly maps to a Multinomial distribution ("replacement sampling").

You're going to need to be really precise on the naming and I'm not at
all sure there is a single version that would be useful enough to
belong in core.

Personally I believe the proposed ->getBytesFromAlphabet() to be useful enough to belong in core. There are quite a few use cases that are naturally restricted to ASCII: Basically everything that can be considered an identifier.

Arbitrary numeric strings alone likely provide plenty of use cases:

- Backup codes for multi-factor authentication (restricting the input to digits allows you to leverage a numeric keyboard, reducing the chance for input errors).
- Random phone numbers for testing purposes.
- Random credit card numbers (you just need to calculate the checksum yourself).

While hexadecimal strings can easily be generated by applying bin2hex to the output of ->getBytes(), that is also unintuitive, because the result length is twice the number of bytes.

whereas a 64 Bit engine could generate randomness for 8 characters at once.

I'm really not sure that many programs are going to be speed limited
by random number generation.

The syscall cost to retrieve bytes from the 'Secure' engine (which is the default engine) can be expensive. Especially on older operating system versions and depending on how many more Meltdown/Spectre-style vulnerabilities they find.

A userland implementation that generates 1000 random numeric strings with 100 characters each using the 'Secure' engine requires 146ms on my computer. The native implementation without optimization requires 101ms and the optimized native implementation 26ms.

For Xoshiro256** (the fastest engine) the numbers are 89ms, 7ms and 3ms respectively.

Benchmark attached.

For those that are, writing their own generator to consume all 64 bits
of randomness for each call sounds reasonably sensible, unless a
useful general api can be thought of.

This cannot be reasonably done in userland, because you pay an increased cost to turn the bytes into numbers and then to perform the necessary bit fiddling to debias the numbers.

For the float side of the RFC, as there are technical limitations on
which platforms it would be usable on, there needs to be a way of
determining whether the nextFloat and getFloat methods are going to
work. The way this is done on Imagick is to put appropriate defines in
the stub file and in the C code implementations so that the methods
aren't available on the class for the platforms where it isn't going
to function correctly.

I made a PR for that to Tim's repo, though I don't know of an
environment where it can be tested.

I've seent he PR and we shortly discussed this in chat. I would defer this to the code review, because this amounts to an implementation detail. Especially since all reasonable server platforms use IEEE 754.

Best regards
Tim Düsterhus

<<attachment: alphabet-benchmark.php>>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to