Issue with the PRNG used by Postgres
hi all, We have an interesting problem, where PG went to PANIC due to stuck spinlock case. On careful analysis and hours of trying to reproduce this(something that showed up in production after almost 2 weeks of stress run), I did some statistical analysis on the RNG generator that PG uses to create the backoff for the spin locks. My expectation is that, this RNG will be fair to all backends. After studying the underlying hash function that PG uses,"xoroshiro" which is from the LFSR family, it seems, that it has failed a bunch of statistical tests and has been mentioned in various blogs. The one that caught my attention was : https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html This mentions the zeroland problem and the big belt when in low Hamming index. What that means, is that when the status reaches a low hamming index state(state where it is mainly all 0's and less 1's), it takes a while to get back to regular entropy. Here is the code from s_lock.c that shows how we add delay for the backoff, before the TAS is tried again. Also, there is a bug here in all likelihood, as the author wanted to round off the value coming out of the equation by adding 0.5. Integer casting in c/c++ will drop the decimal part out((int)12. = 12). And if the RNG keeps producing values close to 0, the delay will never increase(as it will keep integer casting to 0) and end up spinning 1000 times within 1second itself. That is one of the reasons why we could reproduce this by forcing low RNG numbers in our test cases. So, steps that I would recommend 1. Firstly, use ceiling for round off 2. cur_delay should increase by max(1000, RNG generated value) 3. Change the hashing function 4. Capture the new 0 state, and then try to reseed. /* increase delay by a random fraction between 1X and 2X */ status->cur_delay += (int) (status->cur_delay * pg_prng_double(&pg_global_prng_state) + 0.5);
Re: Issue with the PRNG used by Postgres
Hi Andres, This is a little bit more complex than that. The spinlocks are taken in the LWLock(Mutex) code, when the lock is not available right away. The spinlock is taken to attach the current backend to the wait list of the LWLock. This means, that this cannot be controlled. The repro when it happens, it affects any mutex or LWLock code path, since the low hamming index can cause problems by removing fairness from the system. Also, I believe the rounding off error still remains within the RNG. I will send a patch today. Thanks for the response. -Parag On Tue, Apr 9, 2024 at 2:05 PM Andres Freund wrote: > Hi, > > On 2024-04-08 22:52:09 -0700, Parag Paul wrote: > > We have an interesting problem, where PG went to PANIC due to stuck > > spinlock case. > > On careful analysis and hours of trying to reproduce this(something that > > showed up in production after almost 2 weeks of stress run), I did some > > statistical analysis on the RNG generator that PG uses to create the > > backoff for the spin locks. > > ISTM that the fix here is to not use a spinlock for whatever the > contention is > on, rather than improve the RNG. > > Greetings, > > Andres Freund >
Re: Issue with the PRNG used by Postgres
Thank you Robert. I am in the process of patching this. -Parag On Wed, Apr 10, 2024 at 7:43 AM Robert Haas wrote: > On Tue, Apr 9, 2024 at 5:05 PM Andres Freund wrote: > > ISTM that the fix here is to not use a spinlock for whatever the > contention is > > on, rather than improve the RNG. > > I'm not convinced that we should try to improve the RNG, but surely we > need to put parentheses around pg_prng_double(&pg_global_prng_state) + > 0.5. IIUC, the current logic is making us multiply the spin delay by a > value between 0 and 1 when what was intended was that it should be > multiplied by a value between 0.5 and 1.5. > > If I'm reading this correctly, this was introduced here: > > commit 59bb147353ba274e0836d06f429176d4be47452c > Author: Bruce Momjian > Date: Fri Feb 3 12:45:47 2006 + > > Update random() usage so ranges are inclusive/exclusive as required. > > -- > Robert Haas > EDB: http://www.enterprisedb.com >
Re: Issue with the PRNG used by Postgres
hi Tom, First of all thanks for you response. I did not misread it. The 0.5 is added to the result of the multiplication which then uses C integer casting, which does not round off, but just drops the decimal portion. status->cur_delay += (int) (status->cur_delay * pg_prng_double(&pg_global_prng_state) + 0.5); So, if RNG generated 0.001 and cur_delay =1000. Result will be 1000 + int(1000*0.01 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000 <-- back to the same value and there is no change after that starts happening, if the RNG is in the low hamming index state. If avalanche happens soon, then it will correct it self, but in the mean time, we have a stuck_spin_lock PANIC that could bring down a production server. -Parag On Wed, Apr 10, 2024 at 8:09 AM Tom Lane wrote: > Robert Haas writes: > > I'm not convinced that we should try to improve the RNG, but surely we > > need to put parentheses around pg_prng_double(&pg_global_prng_state) + > > 0.5. IIUC, the current logic is making us multiply the spin delay by a > > value between 0 and 1 when what was intended was that it should be > > multiplied by a value between 0.5 and 1.5. > > No, I think you are misreading it, because the assignment is += not =. > The present coding is > > /* increase delay by a random fraction between 1X and 2X */ > status->cur_delay += (int) (status->cur_delay * > pg_prng_double(&pg_global_prng_state) > + 0.5); > > which looks fine to me. The +0.5 is so that the conversion to integer > rounds rather than truncating. > > In any case, I concur with Andres: if this behavior is anywhere near > critical then the right fix is to not be using spinlocks. > > regards, tom lane >
Re: Issue with the PRNG used by Postgres
hi Robert, We are using xoroshiro128 and not moved to the next state of art. We did see a lot of low values as put in my last message. -Parag On Wed, Apr 10, 2024 at 9:37 AM Robert Haas wrote: > On Wed, Apr 10, 2024 at 12:02 PM Tom Lane wrote: > > As I said to Parag, I see exactly no reason to believe that that's a > > problem, unless it happens *a lot*, like hundreds of times in a row. > > If it does, that's an RNG problem not s_lock's fault. Now, I'm not > > going to say that xoroshiro can't potentially do that, because I've > > not studied it. But if it is likely to do that, I'd think we'd > > have noticed the nonrandomness in other ways. > > The blog post to which Parag linked includes this histogram as an > example of a low-Hamming-weight situation: > > Reps | Value > -+ > 84 | 0x00 > 10 | 0x2d >6 | 0xa0 >6 | 0x68 >6 | 0x40 >6 | 0x02 >5 | 0x0b >4 | 0x05 >4 | 0x01 >3 | 0xe1 >3 | 0x5a >3 | 0x41 >3 | 0x20 >3 | 0x16 > > That's certainly a lot more zeroes than you'd like to see in 192 bytes > of output, but it's not hundreds in a row either. > > Also, the blog post says "On the one hand, from a practical > perspective, having vastly, vastly more close repeats than it should > isn't likely to be an issue that users will ever detect in practice. > Xoshiro256's large state space means it is too big to fail any > statistical tests that look for close repeats." So your theory that we > have a bug elsewhere in the code seems like it might be the real > answer. > > -- > Robert Haas > EDB: http://www.enterprisedb.com >
Re: Issue with the PRNG used by Postgres
Yes, the probability of this happening is astronomical, but in production with 128 core servers with 7000 max_connections, with petabyte scale data, this did repro 2 times in the last month. We had to move to a local approach to manager our ratelimiting counters. This is not reproducible very easily. I feel that we should at least shield ourselves with the following change, so that we at least increase the delay by 1000us every time. We will follow a linear back off, but better than no backoff. status->cur_delay += max(1000, (int) (status->cur_delay * pg_prng_double(&pg_global_prng_state) + 0.5)); On Wed, Apr 10, 2024 at 9:43 AM Robert Haas wrote: > On Wed, Apr 10, 2024 at 12:40 PM Parag Paul wrote: > > The reason why this could be a problem is a flaw in the RNG with the > enlarged Hamming belt. > > I attached an image here, with the RNG outputs from 2 backends. I ran > our code for weeks, and collected ther > > values generated by the RNG over many backends. The one in Green (say > backend id 600), stopped flapping values and > > only produced low (near 0 ) values for half an hour, whereas the > Blue(say backend 700), kept generating good values and had > > a range between [0-1) > > During this period, the backed 600 suffered and ended up with spinlock > stuck condition. > > This is a very vague description of a test procedure. If you provide a > reproducible series of steps that causes a stuck spinlock, I imagine > everyone will be on board with doing something about it. But this > graph is not going to convince anyone of anything. > > -- > Robert Haas > EDB: http://www.enterprisedb.com >