Fabien COELHO писал 2021-07-06 23:49:
Hello Yura,
However, I'm not enthousiastic at combining two methods depending on
the range, the function looks complex enough without that, so I would
suggest not to take this option. Also, the decision process adds to
the average cost, which is undesirable.
Given 99.99% cases will be in the likely case, branch predictor should
eliminate decision cost.
Hmmm. ISTM that a branch predictor should predict that unknown < small
should probably be false, so a hint should be given that it is really
true.
Why? Branch predictor is history based: if it were usually true here
then it will be true this time either.
unknown < small is usually true therefore branch predictor will assume
it is true.
I put `likely` for compiler: compiler then puts `likely` path closer.
And as Dean Rasheed correctly mentioned, mask method will have really
bad pattern for branch predictor if range is not just below or equal
to power of two.
On average the bitmask is the better unbiased method, if the online
figures are to be trusted. Also, as already said, I do not really want
to add code complexity, especially to get lower average performance,
and especially with code like "threshold = - range % range", where
both variables are unsigned, I have a headache just looking at it:-)
If you mention https://www.pcg-random.org/posts/bounded-rands.html then
1. first graphs are made with not exact Lemire's code.
Last code from
https://lemire.me/blog/2016/06/30/fast-random-shuffling/
(which I derived from) performs modulo operation only if `(leftover <
range)`.
Even with `rand_range(1000000)` it is just once in four thousands
runs.
2. You can see "Small-Constant Benchmark" at that page, Debiased Int is
1.5 times faster. And even in "Small-Shuffle" benchmark their
unoptimized
version is on-par with mask method.
3. If you go to "Making Improvements/Faster Threshold-Based Discarding"
section, then you'll see code my version is matched with. It is twice
faster than masked method in Small-Shuffle benchmark, and just a bit
slower in Large-Shuffle.
And __builtin_clzl is not free lunch either, it has latency 3-4 cycles
on modern processor.
Well, % is not cheap either.
Well, probably it could run in parallel with some part of xoroshiro,
but it depends on how compiler will optimize this function.
I would certainly select the unbias multiply method if we want a u32
range variant.
There could be two functions.
Yep, but do we need them? Who is likely to want 32 bits pseudo random
ints in a range? pgbench needs 64 bits.
So I'm still inclined to just keep the faster-on-average bitmask
method, despite that it may be slower for some ranges. The average
cost for the worst case in PRNG calls is, if I'm not mistaken:
1 * 0.5 + 2 * 0.25 + 3 * 0.125 + ... ~ 2
which does not look too bad.
You doesn't count cost of branch-misprediction.
https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array
https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/
Therefore calculation should be at least:
1 * 0.5 + 0.5 * (3 + 0.5 * (3 + ...)) = 6
By the way, we have 64bit random. If we use 44bit from it for range <=
(1<<20), then
bias will be less than 1/(2**24). Could we just ignore it (given it is
not crypto
strong random)?
uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
{
uint64 val = xoroshiro128ss(state);
uint64 m;
if ((range & (range-1) == 0) /* handle all power 2 cases */
return range != 0 ? val & (range-1) : 0;
if (likely(range < (1<<20)))
/*
* While multiply method is biased, bias will be smaller than
1/(1<<24) for
* such small ranges. Lets ignore it.
*/
return ((val >> 20) * range) >> 44;
/* Apple's mask method */
m = mask_u64(range-1);
val &= m;
while (val >= range)
val = xoroshiro128ss(state) & m;
return val;
}
Anyway, excuse me for heating this discussion cause of such
non-essential issue.
I'll try to control myself and don't proceed it further.
regards,
Sokolov Yura.