On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <and...@anarazel.de> wrote: > On 2018-11-13 12:04:23 -0500, Robert Haas wrote: > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is > > a bit of a fool's errand, though. I mean, there's no guarantee that > > the first FD that gets passed to the checkpointer is the first one > > opened, or even the first one written, is there? > I'm not sure I understand the danger you're seeing here. It doesn't have > to be the first fd opened, it has to be an fd that's older than all the > writes that we need to ensure made it to disk. And that ought to be > guaranteed by the logic? Between the FileWrite() and the > register_dirty_segment() (and other relevant paths) the FD cannot be > closed.
Suppose backend A and backend B open a segment around the same time. Is it possible that backend A does a write before backend B, but backend B's copy of the fd reaches the checkpointer before backend A's copy? If you send the FD to the checkpointer before writing anything then I think it's fine, but if you write first and then send the FD to the checkpointer I don't see what guarantees the ordering. > > It seems like if you wanted to make this work reliably, you'd need to > > do it the other way around: have the checkpointer (or some other > > background process) open all the FDs, and anybody else who wants to > > have one open get it from the checkpointer. > > That'd require a process context switch for each FD opened, which seems > clearly like a no-go? I don't know how bad that would be. But hey, no cost is too great to pay as a workaround for insane kernel semantics, right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company