Hi
On 2/18/22 07:31, Go Kudo wrote:
I have been looking into output buffering, but don't know the right way to
do it. The buffering works fine if all RNG generation widths are static,
but if they are dynamic so complicated.
I believe the primary issue here is that the engines are expected to
return an uint64_t, instead of a buffer with raw bytes. This requires
you to perform many conversions between the uint64 and the raw buffer:
When calling Randomizer::getBytes() for a custom engine the following
needs to happen:
- The Engine returns a byte string.
- This bytestring is then internally converted into an uint64_t.
- Then calling Randomizer::getBytes() this uint64_t needs to be
converted back to a bytestring.
To avoid those conversations without sacrificing too much performance it
might be possible to return a struct that contains a single 4 or 8-byte
array:
struct four_bytes {
unsigned char val[4];
};
struct four_bytes r;
r.val[0] = (result >> 0) & 0xff;
r.val[1] = (result >> 8) & 0xff;
r.val[2] = (result >> 16) & 0xff;
r.val[3] = (result >> 24) & 0xff;
return r;
.val can be treated as a bytestring, but it does not require dynamic
allocation. By doing that the internal engines (e.g. Xoshiro) would be
consistent with the userland engines.
It is possible to solve this problem by allowing generate() itself to
specify the size it wants, but this would significantly slow down
performance.
I don't think it's a good idea to add a size parameter to generate().
I've looked at the sample code, but do you really need support for
Randomizer? Engine::generate() can output dynamic binaries up to 64 bits.
You can use Engine directly, instead of Randomizer::getBytes().
What exactly is the situation where buffering by Randomizer is needed?
*I* don't need anything. I'm just trying to think of use-cases and
edge-cases. Basically: What would a user attempt to do and what would
their expectations be?
I'm not saying that this buffering *must* be implemented, but this is
something we need to think about. Because changing the behavior later is
pretty much impossible, as users might rely on a specific behavior for
their seeded sequences. The behavior might also need to be part of the
documentation.
Basically what we need to think about is what guarantees we give. As an
example:
1. Calling Engine::generate() with the same seed results in the same
sequence (This guarantee we give, and it is useful).
2. Calling Randomizer::getInt() with the same seeded engine results in
the same numbers for the same parameters (I think this also is useful).
3. Calling Randomizer::getBytes() with the same seeded engine results in
the same byte sequence (This is something we are currently discussing).
4. Calling Randomizer::getBytes() simply concatenates the raw bytes
retrieved by the Engine (This ties into (3)).
5. Calling Randomizer::shuffleArray() with the same seeded engine
results in the same result for the same string (This one is more
debatable, because then we must maintain the exact same shuffleArray()
implementation forever).
All these guarantees should be properly documented within the RFC. The
RFC template (https://wiki.php.net/rfc/template) says:
> Remember that the RFC contents should be easily reusable in the PHP
Documentation.
So by thinking about this now and putting it in the RFC, the
explanations can easily be copied into the documentation if the RFC
passes the vote.
One should not need to look into the implementation to understand how
the Engines and the Randomizer is supposed to work.
Also worried that buffering will cut off random numbers at arbitrary sizes.
It may cause bias in the generated results.
If there's bias in specific bits or bytes of the generated number then
getBytes(32) will already be biased even without buffering, as the raw
bytes are what's of interest here. It does not matter if they are at the
1st or 4th position (for a 32-bit engine).
Best regards
Tim Düsterhus
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php