I wrote: > I'm a bit mystified by this. This test was introduced in Andres' > commit 10a082bf7 of 2023-02-11, and it seems to have been stable > since then. I trawled the buildfarm logs going back three months > and found no similar failures. So why's it failing now? The > most plausible theory seems to be that Michael's recent commits > adding pg_stat_reset_xxx features destabilized the test somehow ... > but I sure don't see how/why.
After a bit more looking around, I have part of a theory. Commit 23c8c0c8f of 2023-11-12 added this, a little ways before the problematic test: -- Test that reset_shared with no argument resets all the stats types -- supported (providing NULL as argument has the same effect). SELECT pg_stat_reset_shared(); The test that is failing is of course -- Test IO stats reset SELECT pg_stat_have_stats('io', 0, 0); SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset FROM pg_stat_io \gset SELECT pg_stat_reset_shared('io'); SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset FROM pg_stat_io \gset SELECT :io_stats_post_reset < :io_stats_pre_reset; So the observed failure could be explained if, between the "pg_stat_reset_shared('io')" call and the subsequent scan of pg_stat_io, concurrent sessions had done more I/O operations than happened since that new pg_stat_reset_shared() call. Previously, the "pre_reset" counts would be large enough to make that a pretty ridiculous theory, but after 23c8c0c8f maybe it's not. To test this idea, I made the test print out the actual values of the counts, like this: @@ -1585,10 +1585,10 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset FROM pg_stat_io \gset -SELECT :io_stats_post_reset < :io_stats_pre_reset; - ?column? ----------- - t +SELECT :io_stats_post_reset, :io_stats_pre_reset; + ?column? | ?column? +----------+---------- + 10452 | 190087 (1 row) Of course, this makes it fail every time, but the idea is to get a sense of the magnitude of the counts; and what I'm seeing is that the "pre reset" counts are typically 10x more than the "post reset" ones, even after 23c8c0c8f. If I remove the suspicious pg_stat_reset_shared() call, there's about 3 orders of magnitude difference; but still you'd think a 10x safety margin would be enough. So this theory doesn't seem to quite work as-is. Perhaps there's some additional contributing factor I didn't think to control. Nonetheless, it seems like a really bad idea that this test of I/O stats reset happens after the newly-added test. It is clearly now dependent on timing and the amount of concurrent activity whether it will pass or not. We should probably re-order the tests to do the old test first; or else abandon this test methodology and just test I/O reset the same way we test the other cases (checking only for timestamp advance). Or maybe we don't really need the pg_stat_reset_shared() test? regards, tom lane