Re: Resetting spilled txn statistics in pg_stat_replication

Amit Kapila Mon, 12 Oct 2020 23:27:59 -0700

On Tue, Oct 13, 2020 at 11:49 AM Masahiko Sawada
<[email protected]> wrote:
>
> On Tue, 13 Oct 2020 at 14:53, Amit Kapila <[email protected]> wrote:
> >
> > On Tue, Oct 13, 2020 at 11:05 AM Tom Lane <[email protected]> wrote:
> > >
> > > Amit Kapila <[email protected]> writes:
> > > >> It is possible that MAXALIGN stuff is playing a role here and or the
> > > >> background transaction stuff. I think if we go with the idea of
> > > >> testing spill_txns and spill_count being positive then the results
> > > >> will be stable. I'll write a patch for that.
> > >
> > > Here's our first failure on a MAXALIGN-8 machine:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08
> > >
> > > So this is just plain not stable.  It is odd though.  I can
> > > easily think of mechanisms that would cause the WAL volume
> > > to occasionally be *more* than the "typical" case.  What
> > > would cause it to be *less*, if MAXALIGN is ruled out?
> > >
> >
> > The original theory I have given above [1] which is an interleaved
> > autovacumm transaction. Let me try to explain in a bit more detail.
> > Say when transaction T-1 is performing Insert ('INSERT INTO stats_test
> > SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000)
> > g(i);') a parallel autovacuum transaction occurs. The problem as seen
> > in buildfarm will happen when autovacuum transaction happens after 80%
> > or more of the Insert is done.
> >
> > In such a situation we will start decoding 'Insert' first and need to
> > spill multiple times due to the amount of changes (more than threshold
> > logical_decoding_work_mem) and then before we encounter Commit of
> > transaction that performed Insert (and probably some more changes from
> > that transaction) we will encounter a small transaction (autovacuum
> > transaction).  The decode of that small transaction will send the
> > stats collected till now which will lead to the problem shown in
> > buildfarm.
>
> That seems a possible scenario.
>
> I think probably this also explains the reason why spill_count
> slightly varied and spill_txns was still 1. The spill_count value
> depends on how much the process spilled out transactions before
> encountering the commit of an autovacuum transaction. Since we have
> the spill statistics per reorder buffer, not per transactions, it's
> possible.
>


Okay, here is an updated version (changed some comments) of the patch
I posted some time back. What do you think? I have tested this on both
Windows and Linux environments. I think it is a bit tricky to
reproduce the exact scenario so if you are fine we can push this and
check or let me know if you any better idea?

-- 
With Regards,
Amit Kapila.

fix_stats_test_2.patch
Description: Binary data

Re: Resetting spilled txn statistics in pg_stat_replication

Reply via email to