Re: New instability in stats regression test

Tom Lane Sat, 25 Nov 2023 11:34:56 -0800

I wrote:
> I'm a bit mystified by this.  This test was introduced in Andres'
> commit 10a082bf7 of 2023-02-11, and it seems to have been stable
> since then.  I trawled the buildfarm logs going back three months
> and found no similar failures.  So why's it failing now?  The
> most plausible theory seems to be that Michael's recent commits
> adding pg_stat_reset_xxx features destabilized the test somehow ...
> but I sure don't see how/why.


After a bit more looking around, I have part of a theory.
Commit 23c8c0c8f of 2023-11-12 added this, a little ways before
the problematic test:

-- Test that reset_shared with no argument resets all the stats types
-- supported (providing NULL as argument has the same effect).
SELECT pg_stat_reset_shared();

The test that is failing is of course

-- Test IO stats reset
SELECT pg_stat_have_stats('io', 0, 0);
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + 
sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
  FROM pg_stat_io \gset
SELECT pg_stat_reset_shared('io');
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + 
sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
  FROM pg_stat_io \gset
SELECT :io_stats_post_reset < :io_stats_pre_reset;

So the observed failure could be explained if, between the
"pg_stat_reset_shared('io')" call and the subsequent scan of
pg_stat_io, concurrent sessions had done more I/O operations
than happened since that new pg_stat_reset_shared() call.
Previously, the "pre_reset" counts would be large enough to
make that a pretty ridiculous theory, but after 23c8c0c8f maybe
it's not.

To test this idea, I made the test print out the actual values
of the counts, like this:

@@ -1585,10 +1585,10 @@
 
 SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) 
+ sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
-SELECT :io_stats_post_reset < :io_stats_pre_reset;
- ?column? 
-----------
- t
+SELECT :io_stats_post_reset, :io_stats_pre_reset;
+ ?column? | ?column? 
+----------+----------
+    10452 |   190087
 (1 row)
 
Of course, this makes it fail every time, but the idea is to get
a sense of the magnitude of the counts; and what I'm seeing is
that the "pre reset" counts are typically 10x more than the
"post reset" ones, even after 23c8c0c8f.  If I remove the
suspicious pg_stat_reset_shared() call, there's about 3 orders
of magnitude difference; but still you'd think a 10x safety
margin would be enough.  So this theory doesn't seem to quite
work as-is.  Perhaps there's some additional contributing factor
I didn't think to control.

Nonetheless, it seems like a really bad idea that this test
of I/O stats reset happens after the newly-added test.  It
is clearly now dependent on timing and the amount of concurrent
activity whether it will pass or not.  We should probably
re-order the tests to do the old test first; or else abandon
this test methodology and just test I/O reset the same way
we test the other cases (checking only for timestamp advance).
Or maybe we don't really need the pg_stat_reset_shared() test?

                        regards, tom lane

Re: New instability in stats regression test

Reply via email to