On Mon, Nov 27, 2023 at 02:01:51PM -0500, Tom Lane wrote: > The problem as I see it is that this test: > > SELECT :io_stats_post_reset < :io_stats_pre_reset; > > requires an assumption that less I/O has happened since the commanded > reset action than happened before it (extending back to the previous > reset, or cluster start). Since concurrent processes might be doing > I/O, this has a race condition. If we are slow enough about obtaining > :io_stats_post_reset, the test *will* fail eventually. But the shorter > the distance back to the previous reset, the bigger the odds of > observable trouble; thus Michael's concern that adding more reset > tests in future would increase the risk of failure.
The new reset added just before checking the contents of pg_stat_io reduces :io_stats_pre_reset from 7M to 50k. That's a threshold easy to reach if you have a checkpoint or an autovacuum running in parallel. I have not checked the buildfarm logs in details, but I'd put a coin on a checkpoint triggered by time if the issue happened on a slow machine. -- Michael
signature.asc
Description: PGP signature