Hi,
On Wed, Oct 28, 2020 at 12:34:58PM +0500, Andrey Borodin wrote:
Tomas, thanks for looking into this!
28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.von...@2ndquadrant.com>
написал(а):
This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?
If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?
All patches in this thread aim at the same goal: improve performance in
presence of MultiXact locks contention.
I could not build synthetical reproduction of the problem, however I did some
MultiXact stressing here [0]. It's a clumsy test program, because it still is
not clear to me which parameters of workload trigger MultiXact locks
contention. In generic case I was encountering other locks like *GenLock:
XidGenLock, MultixactGenLock etc. Yet our production system encounters this
problem approximately once in a month through this year.
Test program locks for share different set of tuples in presence of concurrent
full scans.
To produce a set of locks we choose one of 14 bits. If a row number has this
bit set to 0 we add lock it.
I've been measuring time to lock all rows 3 time for each of 14 bits, observing
total time to set all locks.
During the test I was observing locks in pg_stat_activity, if they did not
contain enough MultiXact locks I was tuning parameters further (number of
concurrent clients, number of bits, select queries etc).
Why is it so complicated? It seems that other reproductions of a problem were
encountering other locks.
It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...
I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:
(a) it actually will help with the issue you're observing on production
and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)
I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services