On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <al...@hintbits.com> wrote:
> > > > On 7. Sep 2018, at 17:57, Chris Travers <chris.trav...@adjust.com> > wrote: > > > > Hi; > > > > Attached is the patch we are fully testing at Adjust. There are a few > non-obvious aspects of the code around where the patch hits. I have run > make check on Linux and MacOS, and make check-world on Linux (check-world > fails on MacOS on all versions and all branches due to ecpg failures). > Automatic tests are difficult because it is to a rare race condition which > is difficult (or possibly impossible) to automatically create. Our current > approach testing is to force the issue using a debugger. Any ideas on how > to reproduce automatically are appreciated but even on our heavily loaded > systems this seems to be a very small portion of queries that hit this case > (meaning the issue happens about once a week for us). > > > I did some manual testing on it, putting breakpoints in the > ResolveRecoveryConflictWithLock in the startup process on a streaming > replica > (configured with a very low max_standby_streaming_delay, i.e. 100ms) and > to the > posix_fallocate call on the normal backend on the same replica. At this > point I > also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 > nonstop) > and resumed execution on both the backend and the startup process. > > Then I simulated a conflict by creating a rather big (3GB) table on the > master, > doing some updates on it and then running an aggregate on the replica > backend > (i.e. 'select count(1) from test' with 'force_parallel_mode = true') > where I > set the breakpoint. The aggregate and force_parallel_mode ensured that > the query was executed as a parallel one, leading to allocation of the DSM > > Once the backend reached the posix_fallocate call and was waiting on a > breakpoint, I called 'vacuum full test' on the master that lead to a > conflict > on the replica running 'select from test' (in a vast majority of cases), > triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup > process, since the startup process tried to cancel the conflicting > backend. At > that point I continued execution of the startup process (which would loop > in > CancelVirtualTransaction sending SIGUSR1 to the backend while the backend > waited to be resumed from the breakpoint). Right after that, I changed the > size > parameter on the backend to something that would make posix_fallocate run > for a > bit longer, typically ten to hundred MB. Once the backend process was > resumed, > it started receiving SIGUSR1 from the startup process, resulting in > posix_fallocate existing with EINTR. > > With the patch applied, the posix_fallocate loop terminated right away > (because > of QueryCancelPending flag set to true) and the backend went through the > cleanup, showing an ERROR of cancelling due to the conflict with recovery. > Without the patch, it looped indefinitely in the dsm_impl_posix_resize, > while > the startup process were looping forever, trying to send SIGUSR1. > > One thing I’m wondering is whether we could do the same by just blocking > SIGUSR1 > for the duration of posix_fallocate? > Many thanks! If we were to do that, I would say we should mask all signals we can mask during the call. I don't have a problem going down that road instead if people think it is better. As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to a system affected by this and expect that to be done this week. > > Cheers, > Oleksii Kliukin -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin