Hi, On 2022-05-20 01:25:10 -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > On 2022-05-20 00:22:14 -0400, Tom Lane wrote: > >> There's some fallout in the expected-file, of course, but this > >> does seem to fix it (20 consecutive successful runs now at > >> 100/2). Don't see why though ... > > > I think what might be happening is that the transactional stats updates get > > reported by s2 *before* the non-transactional stats updates come in from > > s1. I.e. the pgstat_report_stat() at the end of s2_commit_prepared_a does a > > report, because the machine is slow enough for it to be "time to reports > > stats > > again". Then s1 reports its non-transactional stats. > > Sounds plausible. And I left the test loop running, and it's now past > 100 consecutive successes, so I think this change definitely "fixes" it.
FWIW, the problem can be reliably reproduced by sticking a pgstat_force_next_flush() into pgstat_twophase_postcommit(). This is the only failure when doing so. > > It looks like our stats maintenance around truncation isn't quite > > "concurrency > > safe". That code hasn't meaningfully changed, but it'd not be surprising if > > it's not 100% precise... > > Yeah. Probably not something to try to improve post-beta, especially > since it's not completely clear how transactional and non-transactional > cases *should* interact. Yea. It's also not normally particularly crucial to be accurate down to that degree. > Maybe non-transactional updates should be > pushed immediately? But I'm not sure if that's fully correct, and > it definitely sounds expensive. I think that'd be far too expensive - the majority of stats are non-transactional... I think what we could do is to model truncates as subtracting the number of live/dead rows the truncating backend knows about, rather than setting them to 0. But that of course could incur other inaccuracies. > I'd be good with tweaking this test case as you suggest, and maybe > revisiting the topic later. Pushed the change of the test. Christoph, just to make sure, can you confirm that this fixes the test instability for you? > Kyotaro-san worried about whether any other places in stats.spec > have the same issue. I've not seen any evidence of that in my > tests, but perhaps some other machine with different timing > could find it. I tried to find some by putting in forced flushes in a bunch of places before, and now some more, without finding further cases. Greetings, Andres Freund