(Replies to a couple of different messages below) On Wed, Nov 14, 2018 at 6:04 AM Robert Haas <robertmh...@gmail.com> wrote: > On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: > > There is one major problem with this patch > > If there's only one, you're doing great! Although admittedly this > seems like a big one...
Make that two. > > 2. Offload the BufferSync() work to bgwriter, so the checkpointer can > > keep draining the pipe. Communication between checkpointer and > > bgwriter can be fairly easily multiplexed with the pipe draining work. > > That sounds a little like you are proposing to go back to the way > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and, > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess > the division of labor wouldn't be quite the same. But is there an argument against it? The checkpointer would still be creating checkpoints including running fsync, but the background writer would be, erm, writing, erm, in the background. Admittedly it adds a whole extra rabbit hole to this rabbit hole, which was itself a diversion from my goal of refactoring the syncing machinery to support undo logs. But the other two ideas seem to suck and/or have correctness issues. On Wed, Nov 14, 2018 at 7:43 AM Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <and...@anarazel.de> wrote: > > On 2018-11-13 12:04:23 -0500, Robert Haas wrote: > > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is > > > a bit of a fool's errand, though. I mean, there's no guarantee that > > > the first FD that gets passed to the checkpointer is the first one > > > opened, or even the first one written, is there? > > I'm not sure I understand the danger you're seeing here. It doesn't have > > to be the first fd opened, it has to be an fd that's older than all the > > writes that we need to ensure made it to disk. And that ought to be > > guaranteed by the logic? Between the FileWrite() and the > > register_dirty_segment() (and other relevant paths) the FD cannot be > > closed. > > Suppose backend A and backend B open a segment around the same time. > Is it possible that backend A does a write before backend B, but > backend B's copy of the fd reaches the checkpointer before backend A's > copy? If you send the FD to the checkpointer before writing anything > then I think it's fine, but if you write first and then send the FD to > the checkpointer I don't see what guarantees the ordering. I'm not sure if it matters whether we send the fd before or after the write, but we still need some kind of global ordering of fds that can order a given fd with respect to writes in other processes, so the patch introduces a global shared counter captured immediately after open() (including when reopened in the vfd machinery). In your example, both fds arrive in the checkpointer in some order, and it will keep the one with the older sequence number and close the other one. This sorting of all interesting fds will be forced before the checkpoint completes by AbsorbFsyncRequests(), which drains all messages from the pipe until it sees a message for the next checkpoint cycle. Hmm, I think there is a flaw in the plan here though. Messages for different checkpoint cycles race to enter the pipe around the time the cycle counter is bumped, so you could have a message for n hiding behind a message for n + 1 and not drain enough; I'm not sure and need to look at something else today, but I see a couple of potential solutions to that which I will mull over, based on either a new shared counter increment or a special pipe message written after BufferSync() by the bgwriter (if we go for idea #2; Andres had something similar in the original prototype but it could self-deadlock). I need to figure out if that is a suitable barrier due to buffer interlocking. > > > It seems like if you wanted to make this work reliably, you'd need to > > > do it the other way around: have the checkpointer (or some other > > > background process) open all the FDs, and anybody else who wants to > > > have one open get it from the checkpointer. > > > > That'd require a process context switch for each FD opened, which seems > > clearly like a no-go? > > I don't know how bad that would be. But hey, no cost is too great to > pay as a workaround for insane kernel semantics, right? Yeah, seems extremely expensive and unnecessary. It seems sufficient to track the global opening order... or at least a proxy that identifies the fd that performed the oldest write. Which I believe this patch is doing. -- Thomas Munro http://www.enterprisedb.com