Micheal,

Thanks for much for the quick response.

> On 15 Dec 2021, at 00:37, Michael Paquier <mich...@paquier.xyz> wrote:
> 
> On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote:
>> The reindex went fine in the primary database and in one of our
>> standby. The other standby that we also operate for some reason
>> ended up in a state where all transactions were locked by the WAL
>> process and the WAL process was not able to make any progress. In
>> order to solve this issue we had to move traffic from the “bad”
>> standby to the healthy one and then kill all transactions that were
>> running in the “bad” standby. After that, replication was able to
>> resume successfully.
> 
> You are referring to the startup process that replays WAL, right?
That is correct, I’m talking about the startup process that replays the WAL 
files.

> Without having an idea about the type of workload your primary and/or
> standbys are facing, as well as an idea of the configuration you are
> using on both (hot_standby_feedback for one), I have no direct idea,

Primary handle IOT data ingestion. The table that we had to REINDEX gets 
updated every time a new message arrives in the system so updated are happening 
very often on that table, thus, the index/table bloat. The standby at any point 
in time would be receiving queries that would take advantage of the indexes 
that were being re indexed.  hot_standby_feedback is currently turned OFF on 
the standbys. 

> but that could be a conflict caused by a concurrent vacuum.


> 
> Seeing where things got stuck could also be useful, perhaps with a
> backtrace of the area where it happens and some information around
> it.
> 
>> I’m just trying to understand what could have caused this issue. I
>> was not able to identify any queries in the standby that would be
>> locking the WAL process. Any insight would be more than welcome!
> 
> That's not going to be easy without more information, I am afraid.
> --
> Michael

Reply via email to