On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <and...@anarazel.de> wrote: > The primary one in the context here is that if we do *not* have to lock > the buffers all ahead of time, we can simplify the interface. We > certainly can't lock the buffers over IO (due to buffer reclaim) as > we're doing right now, so we'd need another phase, called by the "user" > during undo insertion. But if we do not need to lock the buffers before > the insertion over all starts, the inserting location doesn't have to > care. > > Secondarily, all the reasoning for needing to lock all buffers ahead of > time was imo fairly unconvincing. Following the "recipe" for WAL > insertions is a good idea when writing a new run-of-the-mill WAL > inserting location - but when writing a new fundamental facility, that > already needs to modify how WAL works, then I find that much less > convincing.
So, you seem to be talking about something here which is different than what I thought we were talking about. One question is whether we need to lock all of the buffers "ahead of time," and I think the answer to that question is probably "no." Since nobody else can be writing to those buffers, and probably also nobody can be reading them except maybe for some debugging tool, it should be fine if we enter the critical section and then lock them at the point when we write the bytes. I mean, there shouldn't be any contention, and I don't see any other problems. The other question is whether need to hold all of the buffer locks at the same time, and that seems a lot more problematic to me. It's hard to say exactly whether this unsafe, because it depends on exactly what you think we're doing here, and I don't see that you've really spelled that out. The normal thing to do is call PageSetLSN() on every page before releasing the buffer lock, and that means holding all the buffer locks until after we've called XLogInsert(). Now, you could argue that we should skip setting the page LSN because the data ahead of the insertion pointer is effectively invisible anyway, but I bet that causes problems with checksums, at least, since they rely on the page LSN being accurate to know whether to emit WAL when a buffer is written. You could argue that we could do the XLogInsert() first and only after that lock and dirty the pages one by one, but I think that might break checkpoint interlocking, since it would then be possible for the checkpoint scan to pass over a buffer that does not appear to need writing for the current checkpoint but later gets dirtied and stamped with an LSN that would have caused it to be written had it been there at the time the checkpoint scan reached it. I really can't rule out the possibility that there's some way to make something in this area work, but I don't know what it is, and I think it's a fairly risky area to go tinkering. > Well, in the version of code that I was reviewing here, I don't there is > such a limit (there is a limit for buffers per undo record, but no limit > on the number of records inserted together). I think Dilip added a limit > since. And we have the issue of a lot of IO happening while holding > content locks on several pages. So I don't think it's a straw man at > all. Hmm, what do you mean by "a lot of IO happening while holding content locks on several pages"? We might XLogInsert() but there shouldn't be any buffer I/O going on at that point. If there is, I think that should be redesigned. We should collect buffer pins first, without locking. Then lock. Then write. Or maybe lock-and-write, but only after everything's pinned. The fact of calling XLogInsert() while holding buffer locks is not great, but I don't think it's any worse here than in any other part of the system, because the undo buffers aren't going to be suffering concurrent access from any other backend, and because there shouldn't be more than a few of them. > > 2. The write-ahead logging protocol says that you're supposed to lock > > all the buffers at once. See src/backend/access/transam/README. If > > you want to go patch that file, then this patch can follow whatever > > the locking rules in the patched version are. But until then, the > > patch should follow *the actual rules* not some other protocol based > > on a hand-wavy explanation in an email someplace. Otherwise, you've > > got the same sort of undocumented disaster-waiting-to-happen that you > > keep complaining about in other parts of this patch. We need fewer of > > those, not more! > > But that's not what I'm asking for? I don't even know where you take > from that I don't want this to be documented. I'm mainly asking for a > comment explaining why the current behaviour is what it is. Because I > don't think an *implicit* "normal WAL logging rules" is sufficient > explanation, because all the locking here happens one or two layers away > from the WAL logging site - so it's absolutely *NOT* obvious that that's > the explanation. And I don't think any of the locking sites actually has > comments explaining why the locks are acquired at that time (in fact, > IIRC until the review some even only mentioned pinning, not locking). I didn't intend to suggest that you don't want this to be documented. What I intended to suggest was that you seem to want to deviate from the documented rules, and it seems to me that we shouldn't do that unless we change the rules first, and I don't know what you think the rules should be or why those rules are safe. I think I basically agree with you about the rest of this: the API needs to be non-confusing and adequately documented, and it should avoiding acquiring buffer locks until we have all the relevant pins. > I think what primarily makes me concerned is that it's not clear to me > what guarantees that discard is the only reason for the block to > potentially be missing. I contrast to most other similar cases where WAL > replay simply re-creates the objects when trying to replay an action > affecting such an object, here we simply skip over the WAL logged > operation. So if e.g. the entire underlying UNDO file got lost, we > neither re-create it with valid content, nor error out. Which means we > got to be absolutely sure that all undo files are created in a > persistent manner, at their full size. And that there's no way that data > could get lost, without forcing us to perform REDO up to at least the > relevant point again. I think the crucial question for me here is the extent to which we're cross-checking against the discard pointer. If we're like, "oh, this undo data isn't on disk any more, it must've already been discarded, let's ignore the write," that doesn't sound particularly great, because files sometimes go missing. But, if we're like, "oh, we dirtied this undo buffer but now that undo has been discarded so we don't need to write the data back to the backing file," that seems fine. The discard pointer is a fully durable, WAL-logged thing; if it's somehow wrong, we have got huge problems anyway. > While it appears that we always WAL log the undo extension, I am not > convinced the recovery interlock is strong enough. For one > UndoLogDiscard() unlinks segments before WAL logging their removal - > which means if we crash after unlink() and before the > XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in > practice we might be fine, because there ought to be nobody still > referencing that UNDO - but I don't think that's actually guaranteed as > is). Hmm, that sounds a little worrying. I think there are two options here: unlike what we do with buffers, where we can use buffer locking etc. to make the insertion of the WAL record effectively simultaneous with the changes to the data pages, the removal of old undo files has to happen either before or after XLogInsert(). I think "after" would be better. If we do it before, then upon starting up, we have to accept that there might be undo which is not officially discarded which nevertheless no longer exists on disk; but that might also cause us to ignore real corruption. If we do it after, then we can just treat it as a non-critical cleanup that can be performed lazily and at leisure: at any time, without warning, the system may choose to remove any or all undo backing files all of whose address space is discarded. If we fail to remove files, we can just emit a WARNING and maybe retry later at some convenient point in time, or perhaps even just accept that we'll leak the file in that case. > Nor do I see where we're updating minRecoveryLocation when > replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during > recovery could be stopped before the discard has been replayed, leaving > us with wrong UNDO, but allowing write acess. Seems we'd at least need a > few more XLogFlush() calls. That sounds like a problem, but it seems like it might be better to make sure that minRecoveryLocation gets bumped, rather than adding XLogFlush() calls. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company