On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <mich...@paquier.xyz> wrote:
> On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote: > > Oh, perhaps I do, depending on what you mean by worker. There are a > couple > > of periodic processes that connect to the server to obtain metrics. Is > that > > what is triggering this issue? In my case I could probably suspend them > > until the replay has reached the desired point. > > That would be it. How do you decide when those begin to run and connect > to Postgres. Do you use pg_isready or similar in a loop for sanity > checks? > I do not, they just try to connect and bail if they cannot. > > I have noticed this behavior in the past but prior to 9.6.10 restarting > the > > server would fix the issue. And the replay always seemed to reach a point > > past which the problem would not re-occur. > > You are picking my interest here. Did you actually see the same > problem? In 9.6.10 what happens is that I have tightened the consistent > point checks and logic so as inconsistent page issues would actually > show up when they should, and that those become reproducible so as we > can track down any rogue WAL record or inconsistent behavior. > Yes, I've seen this problem occasionally in the past. I think only in the 9.6 series. But before 9.6.10, if I restarted the server it would start replaying WAL again and typically when it reached the point where it PANICed before, instead it would report a consistent state and allow read-only connections. Sometimes it would then PANIC again after more WAL was replayed. But eventually it would reach a point where it seemed to be able to replay WAL indefinitely without the issue happening. dave