On Wed, Sep 5, 2018 at 6:55 PM Andres Freund <and...@anarazel.de> wrote:
> Hi, > > On 2018-09-05 18:48:44 +0200, Chris Travers wrote: > > Will submit a patch here shortly. Thanks! Should we do for master and > > 10? Or 9.6 too? > > Please don't top-post on this list. This needs to be done in all > branches where the posix_fallocate call is present. > > > > Yep, Maybe we should check for signals there. > > > > > > On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro < > thomas.mu...@enterprisedb.com> > > > wrote: > > > > > >> On Wed, Sep 5, 2018 at 8:23 AM Chris Travers < > chris.trav...@adjust.com> > > >> wrote: > > >> > 1. The query is in a parallel index scan or similar > > >> > 2. A process is executing a parallel plan and allocating a > significant > > >> chunk of memory (2MB for example) in dynamic shared memory. > > >> > 3. The startup process goes into a loop where it sends a sigusr1, > > >> sleeps 5m, and sends another sigusr1 etc. > > >> > 4. The sigusr1 aborts the system call, which is then retried. > > >> > 5. Because the system call takes more than 5ms, we end up in an > > >> endless loop > > What you're presumably encountering here is a recovery conflict. > Agreed but the question is how to correct what is a fairly interesting race condition. > > > > On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.trav...@adjust.com> > > wrote: > > >> Do you mean this loop in dsm_impl_posix_resize() is getting > > >> interrupted constantly and never completing? > > >> > > >> /* We may get interrupted, if so just retry. */ > > >> do > > >> { > > >> rc = posix_fallocate(fd, 0, size); > > >> } while (rc == EINTR); > > >> > > Probably worthwile to check that the dsm code is properly robust if > errors are thrown from within here. > Will check that too. Thanks! > > > Greetings, > > Andres Freund > -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin