Issue with the PRNG used by Postgres

2024-04-08 Thread Parag Paul
hi all,
 We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.

 My expectation is that, this RNG will be fair to all backends.
After studying the underlying hash function that PG uses,"xoroshiro" which
is from the LFSR family, it seems, that it has failed a bunch of
statistical tests and has been mentioned in various blogs.
The one that caught my attention was :
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
This mentions the zeroland problem and the big belt when in low Hamming
index. What that means, is that when the status reaches a low hamming index
state(state where it is mainly all 0's and less 1's), it takes a while to
get back to regular entropy.

Here is the code from s_lock.c that shows how we add delay for the backoff,
before the TAS is tried again.
Also, there is a bug here in all likelihood, as the author wanted to round
off the value coming out of the equation by adding 0.5. Integer casting in
c/c++ will drop the decimal part out((int)12. = 12).
And if the RNG keeps producing values close to 0, the delay will never
increase(as it will keep integer casting to 0) and end up spinning 1000
times within 1second itself. That is one of the reasons why we could
reproduce this by forcing low RNG numbers in our test cases.
So, steps that I would recommend
1. Firstly, use ceiling for round off
2. cur_delay should increase by max(1000, RNG generated value)
3. Change the hashing function
4. Capture the new 0 state, and then try to reseed.

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);


Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul
Hi Andres,
This is a little bit more complex than that. The spinlocks are taken in the
LWLock(Mutex) code, when the lock is not available right away.
The spinlock is taken to attach the current backend to the wait list of the
LWLock. This means, that this cannot be controlled.
The repro when it happens, it affects any mutex or LWLock code path, since
the low hamming index can cause problems by removing fairness from the
system.

Also, I believe the rounding off error still remains within the RNG. I will
send a patch today.

Thanks for the response.
-Parag

On Tue, Apr 9, 2024 at 2:05 PM Andres Freund  wrote:

> Hi,
>
> On 2024-04-08 22:52:09 -0700, Parag Paul wrote:
> >  We have an interesting problem, where PG went to PANIC due to stuck
> > spinlock case.
> > On careful analysis and hours of trying to reproduce this(something that
> > showed up in production after almost 2 weeks of stress run), I did some
> > statistical analysis on the RNG generator that PG uses to create the
> > backoff for the spin locks.
>
> ISTM that the fix here is to not use a spinlock for whatever the
> contention is
> on, rather than improve the RNG.
>
> Greetings,
>
> Andres Freund
>


Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul
Thank you Robert.
I am in the process of patching this.
-Parag

On Wed, Apr 10, 2024 at 7:43 AM Robert Haas  wrote:

> On Tue, Apr 9, 2024 at 5:05 PM Andres Freund  wrote:
> > ISTM that the fix here is to not use a spinlock for whatever the
> contention is
> > on, rather than improve the RNG.
>
> I'm not convinced that we should try to improve the RNG, but surely we
> need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> 0.5. IIUC, the current logic is making us multiply the spin delay by a
> value between 0 and 1 when what was intended was that it should be
> multiplied by a value between 0.5 and 1.5.
>
> If I'm reading this correctly, this was introduced here:
>
> commit 59bb147353ba274e0836d06f429176d4be47452c
> Author: Bruce Momjian 
> Date:   Fri Feb 3 12:45:47 2006 +
>
> Update random() usage so ranges are inclusive/exclusive as required.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul
hi Tom,
 First of all thanks for you response. I did not misread it. The 0.5 is
added to the result of the multiplication which then uses C integer
casting, which does not round off, but just drops the decimal portion.


status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);


So, if RNG generated 0.001 and cur_delay =1000.
Result will be
1000 + int(1000*0.01 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
<--  back to the same value
and there is no change after that starts happening, if the RNG is in the
low hamming index state. If avalanche happens soon, then it will correct it
self, but in the mean time, we have a stuck_spin_lock PANIC that could
bring down a production server.
-Parag

On Wed, Apr 10, 2024 at 8:09 AM Tom Lane  wrote:

> Robert Haas  writes:
> > I'm not convinced that we should try to improve the RNG, but surely we
> > need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> > 0.5. IIUC, the current logic is making us multiply the spin delay by a
> > value between 0 and 1 when what was intended was that it should be
> > multiplied by a value between 0.5 and 1.5.
>
> No, I think you are misreading it, because the assignment is += not =.
> The present coding is
>
> /* increase delay by a random fraction between 1X and 2X */
> status->cur_delay += (int) (status->cur_delay *
> pg_prng_double(&pg_global_prng_state)
> + 0.5);
>
> which looks fine to me.  The +0.5 is so that the conversion to integer
> rounds rather than truncating.
>
> In any case, I concur with Andres: if this behavior is anywhere near
> critical then the right fix is to not be using spinlocks.
>
> regards, tom lane
>


Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul
hi Robert,
We are using xoroshiro128 and not moved to the next state of art.
We did see a lot of low values as put in my last message.
-Parag

On Wed, Apr 10, 2024 at 9:37 AM Robert Haas  wrote:

> On Wed, Apr 10, 2024 at 12:02 PM Tom Lane  wrote:
> > As I said to Parag, I see exactly no reason to believe that that's a
> > problem, unless it happens *a lot*, like hundreds of times in a row.
> > If it does, that's an RNG problem not s_lock's fault.  Now, I'm not
> > going to say that xoroshiro can't potentially do that, because I've
> > not studied it.  But if it is likely to do that, I'd think we'd
> > have noticed the nonrandomness in other ways.
>
> The blog post to which Parag linked includes this histogram as an
> example of a low-Hamming-weight situation:
>
> Reps | Value
> -+
>   84 | 0x00
>   10 | 0x2d
>6 | 0xa0
>6 | 0x68
>6 | 0x40
>6 | 0x02
>5 | 0x0b
>4 | 0x05
>4 | 0x01
>3 | 0xe1
>3 | 0x5a
>3 | 0x41
>3 | 0x20
>3 | 0x16
>
> That's certainly a lot more zeroes than you'd like to see in 192 bytes
> of output, but it's not hundreds in a row either.
>
> Also, the blog post says "On the one hand, from a practical
> perspective, having vastly, vastly more close repeats than it should
> isn't likely to be an issue that users will ever detect in practice.
> Xoshiro256's large state space means it is too big to fail any
> statistical tests that look for close repeats." So your theory that we
> have a bug elsewhere in the code seems like it might be the real
> answer.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul
Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.
This is not reproducible very easily. I feel that we should at least shield
ourselves with the following change, so that we at least increase the delay
by 1000us every time. We will follow a linear back off, but better than no
backoff.
  status->cur_delay += max(1000, (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5));

On Wed, Apr 10, 2024 at 9:43 AM Robert Haas  wrote:

> On Wed, Apr 10, 2024 at 12:40 PM Parag Paul  wrote:
> > The reason why this could be a problem is a flaw in the RNG with the
> enlarged Hamming belt.
> > I attached an image here, with the RNG outputs from 2 backends. I ran
> our code for weeks, and collected ther
> > values generated by the RNG over many backends. The one in Green (say
> backend id 600), stopped flapping values and
> > only produced low (near 0 ) values for half an hour, whereas the
> Blue(say backend 700), kept generating good values and had
> > a range between [0-1)
> > During this period, the backed 600 suffered and ended up with spinlock
> stuck condition.
>
> This is a very vague description of a test procedure. If you provide a
> reproducible series of steps that causes a stuck spinlock, I imagine
> everyone will be on board with doing something about it. But this
> graph is not going to convince anyone of anything.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>