Hi, After the issue reported in [1] got fixed, I've restarted the multi-xact stress test, hoping to reproduce the issue. But so far no luck :-(
I've started slightly different tests on two machines - on one machine I've done this: a) init.sql create table t (a int); insert into t select i from generate_series(1,100000000) s(i); alter table t add primary key (a); b) select.sql SELECT * FROM t WHERE a = (1+mod(abs(hashint4(extract(epoch from now())::int)), 100000000)) FOR KEY SHARE; c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test The idea is to have large table and many clients hitting a small random subset of the rows. A sample of wait events from ~24h run looks like this: e_type | e_name | sum ----------+----------------------+---------- LWLock | BufferContent | 13913863 | | 7194679 LWLock | WALWrite | 1710507 Activity | LogicalLauncherMain | 726599 Activity | AutoVacuumMain | 726127 Activity | WalWriterMain | 725183 Activity | CheckpointerMain | 604694 Client | ClientRead | 599900 IO | WALSync | 502904 Activity | BgWriterMain | 378110 Activity | BgWriterHibernate | 348464 IO | WALWrite | 129412 LWLock | ProcArray | 6633 LWLock | WALInsert | 5714 IO | SLRUWrite | 2580 IPC | ProcArrayGroupUpdate | 2216 LWLock | XactSLRU | 2196 Timeout | VacuumDelay | 1078 IPC | XactGroupUpdate | 737 LWLock | LockManager | 503 LWLock | WALBufMapping | 295 LWLock | MultiXactMemberSLRU | 267 IO | DataFileWrite | 68 LWLock | BufferIO | 59 IO | DataFileRead | 27 IO | DataFileFlush | 14 LWLock | MultiXactGen | 7 LWLock | BufferMapping | 1 So, nothing particularly interesting - there certainly are not many wait events related to SLRU. On the other machine I did this: a) init.sql create table t (a int primary key); insert into t select i from generate_series(1,1000) s(i); b) select.sql select * from t for key share; c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test and the wait events (24h run too) look like this: e_type | e_name | sum -----------+-----------------------+---------- LWLock | BufferContent | 20804925 | | 2575369 Activity | LogicalLauncherMain | 745780 Activity | AutoVacuumMain | 745292 Activity | WalWriterMain | 740507 Activity | CheckpointerMain | 737691 Activity | BgWriterHibernate | 731123 LWLock | WALWrite | 570107 IO | WALSync | 452603 Client | ClientRead | 151438 BufferPin | BufferPin | 23466 LWLock | WALInsert | 21631 IO | WALWrite | 19050 LWLock | ProcArray | 15082 Activity | BgWriterMain | 14655 IPC | ProcArrayGroupUpdate | 7772 LWLock | WALBufMapping | 3555 IO | SLRUWrite | 1951 LWLock | MultiXactGen | 1661 LWLock | MultiXactMemberSLRU | 359 LWLock | MultiXactOffsetSLRU | 242 LWLock | XactSLRU | 141 IPC | XactGroupUpdate | 104 LWLock | LockManager | 28 IO | DataFileRead | 4 IO | ControlFileSyncUpdate | 1 Timeout | VacuumDelay | 1 IO | WALInitWrite | 1 Also nothing particularly interesting - few SLRU wait events. So unfortunately this does not really reproduce the SLRU locking issues you're observing - clearly, there has to be something else triggering it. Perhaps this workload is too simplistic, or maybe we need to run different queries. Or maybe the hw needs to be somewhat different (more CPUs? different storage?) [1] https://www.postgresql.org/message-id/20201104013205.icogbi773przyny5@development regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company