Re: WAL replay issue from 9.6.8 to 9.6.10

Dave Peticolas Wed, 29 Aug 2018 20:20:01 -0700

On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <mich...@paquier.xyz> wrote:

> On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
> > Oh, perhaps I do, depending on what you mean by worker. There are a
> couple
> > of periodic processes that connect to the server to obtain metrics. Is
> that
> > what is triggering this issue? In my case I could probably suspend them
> > until the replay has reached the desired point.
>
> That would be it.  How do you decide when those begin to run and connect
> to Postgres.  Do you use pg_isready or similar in a loop for sanity
> checks?
>

I do not, they just try to connect and bail if they cannot.

> > I have noticed this behavior in the past but prior to 9.6.10 restarting
> the
> > server would fix the issue. And the replay always seemed to reach a point
> > past which the problem would not re-occur.
>
> You are picking my interest here.  Did you actually see the same
> problem?  In 9.6.10 what happens is that I have tightened the consistent
> point checks and logic so as inconsistent page issues would actually
> show up when they should, and that those become reproducible so as we
> can track down any rogue WAL record or inconsistent behavior.
>

Yes, I've seen this problem occasionally in the past. I think only in the
9.6 series. But before 9.6.10, if I restarted the server it would start
replaying WAL again and typically when it reached the point where it
PANICed before, instead it would report a consistent state and allow
read-only connections. Sometimes it would then PANIC again after more WAL
was replayed. But eventually it would reach a point where it seemed to be
able to replay WAL indefinitely without the issue happening.

dave

Re: WAL replay issue from 9.6.8 to 9.6.10

Reply via email to