On 5 January 2018 at 12:16, Stephen Frost <sfr...@snowman.net> wrote:
> Greetings all, > > * Masahiko Sawada (sawada.m...@gmail.com) wrote: > > On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek > > <petr.jeli...@2ndquadrant.com> wrote: > > > On 26/12/17 11:13, Masahiko Sawada wrote: > > >> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek > > >> <petr.jeli...@2ndquadrant.com> wrote: > > >> > > >>>> > > >>>> It's not a problem on crash restart because StartupReorderBuffer > already > > >>>> does the required delete. > > >>>> > > >>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't > appear > > >>>> to have any guard to ensure that the segment files don't already > exist > > >>>> when it goes to serialize a snapshot. Adding one there would > probably be > > >>>> expensive; we don't know the last lsn of the txn yet, so to be > really > > >>>> safe we'd have to do a directory listing and scan for any > txn-$OURXID-* > > >>>> entries. > > >>>> > > >>>> So to fix, I suggest that we should do a > > >>>> slot-specific StartupReorderBuffer-style deletion when we start a > new > > >>>> decoding session on a slot, per attached patch. > > >>>> > > >>>> It might be nice to also add a hook on proc exit, so we don't have > stale > > >>>> buffers lying around until next decoding session, but I didn't want > to > > >>>> add new global state to reorderbuffer.c so I've left that for now. > > >>> > > >>> > > >>> Hmm, can't we simply call the new cleanup function in > > >>> ReplicationSlotRelease()? That's called during process exit and we > know > > >>> there about slot so no extra global variables are needed. > > >>> > > >> > > >> I guess that ReplicationSlotRelease() currently might not get called > > >> if walsender exits by proc_exit(). ReplicationSlotRelease() can is > > >> called by some functions such as WalSndErrorCleanup(), but at least in > > >> the case where wal sender exits due to failed to write data to socket, > > >> ReplicationSlotRelease() didn't get called as far as I tested. > > >> > > > > > > Are you sure about that testing? Did you test it with replication slot > > > active? ReplicationSlotRelease() is called from ProcKill() if the > > > process is using a slot and should be called for any kind of exit > except > > > for outright crash (the kind that makes postgres kill all backends). If > > > it wasn't we'd never unlock the replication slot used by the exiting > > > walsender. > > > > Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but > > ReplicationSlotRelease() got called. I agree that cleanup function > > gets called in ReplicationSlotRelease(). > > This patch is currently in 'Waiting on Author' state, but looks like it > should actually be in Needs Review (or perhaps Ready for Commmitter)..? > > Craig, would you agree with that and, if so, please update the status > accordingly. > > WoA looks correct, Petr suggested a tweak to how and when the cleanup is done that I need to adopt. I'm just about out of time before a trip, but I'll get a colleague to update it if I don't get the chance, so we can get it in and backpatched for this CF. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services