On Sun, Nov 15, 2020 at 12:48 AM Mohamed Wael Khobalatte < mkhobala...@grubhub.com> wrote:
> > > On Sat, Nov 14, 2020 at 2:46 PM Radoslav Nedyalkov <rnedyal...@gmail.com> > wrote: > >> >> >> On Fri, Nov 13, 2020 at 8:13 PM Radoslav Nedyalkov <rnedyal...@gmail.com> >> wrote: >> >>> >>> >>> On Fri, Nov 13, 2020 at 7:37 PM Laurenz Albe <laurenz.a...@cybertec.at> >>> wrote: >>> >>>> On Fri, 2020-11-13 at 15:24 +0200, Radoslav Nedyalkov wrote: >>>> > On a very busy master-standby setup which runs typical olap >>>> processing - >>>> > long living , massive writes statements, we're getting on the >>>> standby: >>>> > >>>> > ERROR: canceling statement due to conflict with recovery >>>> > FATAL: terminating connection due to conflict with recovery >>>> > >>>> > The weird thing is that cancellations happen usually after standby >>>> has experienced >>>> > some huge delay(2h), still not at the allowed maximum(3h). Even >>>> recently run statements >>>> > got cancelled when the delay is already at zero. >>>> > >>>> > Sometimes the situation got relaxed after an hour or so. >>>> > Restarting the server instantly helps. >>>> > >>>> > It is pg11.8, centos7, hugepages, shared_buffers 196G from 748G. >>>> > >>>> > What phenomenon could we be facing? >>>> >>>> Hard to say. Perhaps an unusual kind of replication conflict? >>>> >>>> What is in "pg_stat_database_conflicts" on the standby server? >>>> >>> >>> db01=# select * from pg_stat_database_conflicts; >>> datid | datname | confl_tablespace | confl_lock | confl_snapshot | >>> confl_bufferpin | confl_deadlock >>> >>> -------+-----------+------------------+------------+----------------+-----------------+---------------- >>> 13877 | template0 | 0 | 0 | 0 | >>> 0 | 0 >>> 16400 | template1 | 0 | 0 | 0 | >>> 0 | 0 >>> 16402 | postgres | 0 | 0 | 0 | >>> 0 | 0 >>> 16401 | db01 | 0 | 0 | 51 | >>> 0 | 0 >>> (4 rows) >>> >>> On a freshly restarted standby we've just got similar behaviour after a >>> 2 hours delay and a slow catch-up. >>> confl_snapshots is 51 and we have exactly the same number cancelled >>> statements. >>> >>> >> No luck so far. Searching for the explanation i found we fail into the >> unexplained case when >> snapshot conflicts happen even hot_standby_feedback is on. >> >> Thanks, >> Rado >> >> > > Perhaps you have a value set for old_snapshot_threshold? If not, do the > walreceiver connections drop out? > old_snapshot_threshold is -1 on both master and replica. walreceiver does not drop.