How big have been the delays after the new settings? I guess significantly lower than before, right? how much have they decreased?
Am Mi., 21. Nov. 2018 um 13:18 Uhr schrieb Rene Romero Benavides < rene.romer...@gmail.com>: > You're welcome. > Since last Saturday when you addressed the 10 hour delay, with the new > settings, have you seen more of such delay incidents? what the previous > settings were? > Beware that hot_standby_feedback = on and such long queries in the replica > can increase bloat in the master, are you measuring bloat? if so, do you > notice a significant increase? > > Am Di., 20. Nov. 2018 um 23:08 Uhr schrieb Wyatt Alt <wyatt....@gmail.com > >: > >> Hi Rene, >> On 11/19/18 8:46 PM, Rene Romero Benavides wrote: >> >> Not sure about the root cause but I can make these observations and raise >> some questions: >> 1) 9.6.6 is five bug fix versions behind >> >> Valid point to raise. >> >> 2) 300GB is so big a table, wouldn't make sense to you to partition it ? >> 2a) or if it's partitioned, doesn't the time of creation or dropping of >> new partitions match the time of the conflict? >> >> Partitioning is in the works, but none at the moment. >> >> >> 3) can you track long running transactions on the master? >> 4) what are the isolation levels on master / replica? >> >> Transaction times on the master max out around two minutes. On the >> replica they are much longer -- numerous 1 - 2 hour transactions per day, >> and occasional ones as long as 10 - 20 hours. Isolation levels are read >> committed everywhere. >> >> 5) check for active locks in the replica, I guess you should see some >> blocked transactions during big delays, I've seen this in the past when >> standby_feedback is turned off. >> 6) any out of the ordinary messages in the replica's logs? any evidence >> that has been canceling statements? >> >> I'll make a note to record the active locks next time. I haven't seen >> anything unusual in the logs during these incidents, but have observed >> statements getting canceled at other times, which is why I think the config >> mostly works. >> >> 7) are master and replica exactly the same in terms of resources and main >> parameters? >> 8) how is performance in both nodes while the big delay is happening? IO >> / cpu load / etc. >> >> This brings up a good detail I forgot to mention originally. During the >> last incident, IO utilization on the replica was near 100%, and had been >> for several hours, which I believe was due to the long queries I canceled. >> Now that I think about it, I wonder if the lag may have arisen from IO >> contention between the query and WAL replay, rather than a query conflict. >> >> >> Also, check this out: >> https://www.alibabacloud.com/forum/read-383 >> >> Thanks, interesting reading. >> >> >> Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt <wyatt....@gmail.com >> >: >> >>> Sorry, I see now there was a similar question a few days ago: >>> >>> https://www.postgresql.org/message-id/CAJw4d1WtzOdYzd8Nq2=ufk+z0jy0l_pfg9tvcwprmt3nczq...@mail.gmail.com >>> >>> Two ideas proposed (aside from disconnects): >>> * Autovacuum is truncating a page on the master and taking an >>> AccessExclusiveLock on the table in use on the replica >>> * A "pin conflict", which I'm still unfamiliar with. >>> >>> The user's response says they are in the first bucket, but the argument >>> relies on max_standby_streaming_delay set to -1, while mine is 5 minutes. I >>> need to understand pin conflicts better, but the likely scenario Andrew >>> outlined doesn't apply to me. My offending queries were doing bitmap heap >>> scans on a 300GB table. >>> >>> Reading the thread I see Andres ask for the "precise conflict" the user >>> gets -- is there a way I can get that without a datadir? And to re-frame >>> the original question, are there causes of replication lag that >>> max_standby_streaming_delay would not be expected to prevent, that would be >>> resolved by killing long standby queries? If so, what's the best way to >>> confirm? >>> >>> Wyatt >>> >>> On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt....@gmail.com> wrote: >>> >>>> I've been struggling to eliminate replication lag on a Postgres 9.6.6 >>>> instance on Amazon RDS. I believe the lag is caused by early cleanup >>>> conflicts from vacuums on the master, because I can reliably resolve it by >>>> killing long-running queries on the standby. I most recently saw ten hours >>>> of lag on Saturday and addressed it this way. >>>> >>>> The standby is running with >>>> hot_standby_feedback = on >>>> max_standby_streaming_delay = 5min >>>> max_standby_archive_delay = 30s >>>> >>>> I am not using replication slots on the primary due to reported >>>> negative interactions with pg_repack on large tables. >>>> >>>> My rationale for the first two settings is that hot_standby_feedback >>>> should address my issues almost all the time, but that >>>> max_standby_streaming_delay would sometimes be necessary as a fallback, for >>>> instance in cases of a transient connection loss between the standby and >>>> primary. I believe these settings are mostly working, because lag is less >>>> frequent than it was when I configured them. >>>> >>>> My questions are, >>>> * Am I overlooking anything in my configuration? >>>> * What would explain lag caused by query conflicts given the >>>> max_standby_streaming_delay setting? Shouldn't those queries be getting >>>> killed? >>>> * Is there any particular diagnostic info I should be collecting on the >>>> next occurrence, to help me figure out the cause? Note that as I'm on RDS, >>>> I don't have direct access to the datadir -- just psql. >>>> >>>> Thanks for any advice! >>>> Wyatt >>>> >>> >> >> -- >> El genio es 1% inspiración y 99% transpiración. >> Thomas Alva Edison >> http://pglearn.blogspot.mx/ >> >> > > -- > El genio es 1% inspiración y 99% transpiración. > Thomas Alva Edison > http://pglearn.blogspot.mx/ > > -- El genio es 1% inspiración y 99% transpiración. Thomas Alva Edison http://pglearn.blogspot.mx/