On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <al...@hintbits.com> wrote:

>
>
> > On 7. Sep 2018, at 17:57, Chris Travers <chris.trav...@adjust.com>
> wrote:
> >
> > Hi;
> >
> > Attached is the patch we are fully testing at Adjust.  There are a few
> non-obvious aspects of the code around where the patch hits.    I have run
> make check on Linux and MacOS, and make check-world on Linux (check-world
> fails on MacOS on all versions and all branches due to ecpg failures).
> Automatic tests are difficult because it is to a rare race condition which
> is difficult (or possibly impossible) to automatically create.  Our current
> approach testing is to force the issue using a debugger.  Any ideas on how
> to reproduce automatically are appreciated but even on our heavily loaded
> systems this seems to be a very small portion of queries that hit this case
> (meaning the issue happens about once a week for us).
>
>
> I did some manual testing on it, putting breakpoints in the
> ResolveRecoveryConflictWithLock in the startup process on a streaming
> replica
> (configured with a very low max_standby_streaming_delay, i.e. 100ms) and
> to the
> posix_fallocate call on the normal backend on the same replica. At this
> point I
> also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1
> nonstop)
> and resumed execution on both the backend and the startup process.
>
> Then I simulated a conflict by creating a rather big (3GB) table on the
> master,
> doing some updates on it and then running an aggregate on the replica
> backend
> (i.e. 'select count(1) from test' with 'force_parallel_mode = true')
> where I
> set the breakpoint. The aggregate and force_parallel_mode ensured that
> the query was executed as a parallel one, leading to allocation of the DSM
>
> Once the backend reached the posix_fallocate call and was waiting on a
> breakpoint, I called 'vacuum full test' on the master that lead to a
> conflict
> on the replica running 'select from test' (in a vast majority of cases),
> triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
> process, since the startup process tried to cancel the conflicting
> backend. At
> that point I continued execution of the startup process (which would loop
> in
> CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
> waited to be resumed from the breakpoint). Right after that, I changed the
> size
> parameter on the backend to something that would make posix_fallocate run
> for a
> bit longer, typically ten to hundred MB. Once the backend process was
> resumed,
> it started receiving SIGUSR1 from the startup process, resulting in
> posix_fallocate existing with EINTR.
>
> With the patch applied, the posix_fallocate loop terminated right away
> (because
> of QueryCancelPending flag set to true) and the backend went through the
> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
> Without the patch, it looped indefinitely in the dsm_impl_posix_resize,
> while
> the startup process were looping forever, trying to send SIGUSR1.
>
> One thing I’m wondering is whether we could do the same by just blocking
> SIGUSR1
> for the duration of posix_fallocate?
>

Many thanks!

If we were to do that, I would say we should mask all signals we can mask
during the call.

I don't have a problem going down that road instead if people think it is
better.

As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to
a system affected by this and expect that to be done this week.


>
> Cheers,
> Oleksii Kliukin



-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Reply via email to