You're welcome. Since last Saturday when you addressed the 10 hour delay, with the new settings, have you seen more of such delay incidents? what the previous settings were? Beware that hot_standby_feedback = on and such long queries in the replica can increase bloat in the master, are you measuring bloat? if so, do you notice a significant increase?
Am Di., 20. Nov. 2018 um 23:08 Uhr schrieb Wyatt Alt <wyatt....@gmail.com>: > Hi Rene, > On 11/19/18 8:46 PM, Rene Romero Benavides wrote: > > Not sure about the root cause but I can make these observations and raise > some questions: > 1) 9.6.6 is five bug fix versions behind > > Valid point to raise. > > 2) 300GB is so big a table, wouldn't make sense to you to partition it ? > 2a) or if it's partitioned, doesn't the time of creation or dropping of > new partitions match the time of the conflict? > > Partitioning is in the works, but none at the moment. > > > 3) can you track long running transactions on the master? > 4) what are the isolation levels on master / replica? > > Transaction times on the master max out around two minutes. On the replica > they are much longer -- numerous 1 - 2 hour transactions per day, and > occasional ones as long as 10 - 20 hours. Isolation levels are read > committed everywhere. > > 5) check for active locks in the replica, I guess you should see some > blocked transactions during big delays, I've seen this in the past when > standby_feedback is turned off. > 6) any out of the ordinary messages in the replica's logs? any evidence > that has been canceling statements? > > I'll make a note to record the active locks next time. I haven't seen > anything unusual in the logs during these incidents, but have observed > statements getting canceled at other times, which is why I think the config > mostly works. > > 7) are master and replica exactly the same in terms of resources and main > parameters? > 8) how is performance in both nodes while the big delay is happening? IO / > cpu load / etc. > > This brings up a good detail I forgot to mention originally. During the > last incident, IO utilization on the replica was near 100%, and had been > for several hours, which I believe was due to the long queries I canceled. > Now that I think about it, I wonder if the lag may have arisen from IO > contention between the query and WAL replay, rather than a query conflict. > > > Also, check this out: > https://www.alibabacloud.com/forum/read-383 > > Thanks, interesting reading. > > > Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt....@gmail.com > >: > >> Sorry, I see now there was a similar question a few days ago: >> >> https://www.postgresql.org/message-id/CAJw4d1WtzOdYzd8Nq2=ufk+z0jy0l_pfg9tvcwprmt3nczq...@mail.gmail.com >> >> Two ideas proposed (aside from disconnects): >> * Autovacuum is truncating a page on the master and taking an >> AccessExclusiveLock on the table in use on the replica >> * A "pin conflict", which I'm still unfamiliar with. >> >> The user's response says they are in the first bucket, but the argument >> relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I >> need to understand pin conflicts better, but the likely scenario Andrew >> outlined doesn't apply to me. My offending queries were doing bitmap heap >> scans on a 300GB table. >> >> Reading the thread I see Andres ask for the "precise conflict" the user >> gets -- is there a way I can get that without a datadir? And to re-frame >> the original question, are there causes of replication lag that >> max_standby_streaming_delay would not be expected to prevent, that would be >> resolved by killing long standby queries? If so, what's the best way to >> confirm? >> >> Wyatt >> >> On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt....@gmail.com> wrote: >> >>> I've been struggling to eliminate replication lag on a Postgres 9.6.6 >>> instance on Amazon RDS. I believe the lag is caused by early cleanup >>> conflicts from vacuums on the master, because I can reliably resolve it by >>> killing long-running queries on the standby. I most recently saw ten hours >>> of lag on Saturday and addressed it this way. >>> >>> The standby is running with >>> hot_standby_feedback = on >>> max_standby_streaming_delay = 5min >>> max_standby_archive_delay = 30s >>> >>> I am not using replication slots on the primary due to reported negative >>> interactions with pg_repack on large tables. >>> >>> My rationale for the first two settings is that hot_standby_feedback >>> should address my issues almost all the time, but that >>> max_standby_streaming_delay would sometimes be necessary as a fallback, for >>> instance in cases of a transient connection loss between the standby and >>> primary. I believe these settings are mostly working, because lag is less >>> frequent than it was when I configured them. >>> >>> My questions are, >>> * Am I overlooking anything in my configuration? >>> * What would explain lag caused by query conflicts given the >>> max_standby_streaming_delay setting? Shouldn't those queries be getting >>> killed? >>> * Is there any particular diagnostic info I should be collecting on the >>> next occurrence, to help me figure out the cause? Note that as I'm on RDS, >>> I don't have direct access to the datadir -- just psql. >>> >>> Thanks for any advice! >>> Wyatt >>> >> > > -- > El genio es 1% inspiración y 99% transpiración. > Thomas Alva Edison > http://pglearn.blogspot.mx/ > > -- El genio es 1% inspiración y 99% transpiración. Thomas Alva Edison http://pglearn.blogspot.mx/