On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyal...@gmail.com> wrote:
> > > On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.a...@cybertec.at> > wrote: > >> On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote: >> > On a very busy master-standby setup which runs typical olap processing - >> > long living , massive writes statements, we're getting on the standby: >> > >> > ERROR: canceling statement due to conflict with recovery >> > FATAL: terminating connection due to conflict with recovery >> > >> > The weird thing is that cancellations happen usually after standby has >> experienced >> > some huge delay(2h), still not at the allowed maximum(3h). Even >> recently run statements >> > got cancelled when the delay is already at zero. >> > >> > Sometimes the situation got relaxed after an hour or so. >> > Restarting the server instantly helps. >> > >> > It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G. >> > >> > What phenomenon could we be facing? >> >> Hard to say. Perhaps an unusual kind of replication conflict? >> >> What is in "pg_stat_database_conflicts" on the standby server? >> > > db01=# select * from pg_stat_database_conflicts; > datid | datname | confl_tablespace | confl_lock | confl_snapshot | > confl_bufferpin | confl_deadlock > > -------+-----------+------------------+------------+----------------+-----------------+---------------- > 13877 | template0 | 0 | 0 | 0 | > 0 | 0 > 16400 | template1 | 0 | 0 | 0 | > 0 | 0 > 16402 | postgres | 0 | 0 | 0 | > 0 | 0 > 16401 | db01 | 0 | 0 | 51 | > 0 | 0 > (4 rows) > > On a freshly restarted standby we've just got similar behaviour after a 2 > hours delay and a slow catch-up. > confl_snapshots is 51 and we have exactly the same number cancelled > statements. > > No luck so far. Searching for the explanation i found we fail into the unexplained case when snapshot conflicts happen even hot_standby_feedback is on. Thanks, Rado