Re: POC: Cleaning up orphaned files using undo logs

Robert Haas Tue, 20 Aug 2019 06:17:38 -0700

On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <and...@anarazel.de> wrote:
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.


So, you seem to be talking about something here which is different
than what I thought we were talking about.  One question is whether we
need to lock all of the buffers "ahead of time," and I think the
answer to that question is probably "no." Since nobody else can be
writing to those buffers, and probably also nobody can be reading them
except maybe for some debugging tool, it should be fine if we enter
the critical section and then lock them at the point when we write the
bytes. I mean, there shouldn't be any contention, and I don't see any
other problems.

The other question is whether need to hold all of the buffer locks at
the same time, and that seems a lot more problematic to me.  It's hard
to say exactly whether this unsafe, because it depends on exactly what
you think we're doing here, and I don't see that you've really spelled
that out.  The normal thing to do is call PageSetLSN() on every page
before releasing the buffer lock, and that means holding all the
buffer locks until after we've called XLogInsert(). Now, you could
argue that we should skip setting the page LSN because the data ahead
of the insertion pointer is effectively invisible anyway, but I bet
that causes problems with checksums, at least, since they rely on the
page LSN being accurate to know whether to emit WAL when a buffer is
written. You could argue that we could do the XLogInsert() first and
only after that lock and dirty the pages one by one, but I think that
might break checkpoint interlocking, since it would then be possible
for the checkpoint scan to pass over a buffer that does not appear to
need writing for the current checkpoint but later gets dirtied and
stamped with an LSN that would have caused it to be written had it
been there at the time the checkpoint scan reached it. I really can't
rule out the possibility that there's some way to make something in
this area work, but I don't know what it is, and I think it's a fairly
risky area to go tinkering.

> Well, in the version of code that I was reviewing here, I don't there is
> such a limit (there is a limit for buffers per undo record, but no limit
> on the number of records inserted together). I think Dilip added a limit
> since.  And we have the issue of a lot of IO happening while holding
> content locks on several pages.  So I don't think it's a straw man at
> all.

Hmm, what do you mean by "a lot of IO happening while holding content
locks on several pages"? We might XLogInsert() but there shouldn't be
any buffer I/O going on at that point.  If there is, I think that
should be redesigned.  We should collect buffer pins first, without
locking.  Then lock.  Then write.  Or maybe lock-and-write, but only
after everything's pinned.  The fact of calling XLogInsert() while
holding buffer locks is not great, but I don't think it's any worse
here than in any other part of the system, because the undo buffers
aren't going to be suffering concurrent access from any other backend,
and because there shouldn't be more than a few of them.

> > 2. The write-ahead logging protocol says that you're supposed to lock
> > all the buffers at once.  See src/backend/access/transam/README.  If
> > you want to go patch that file, then this patch can follow whatever
> > the locking rules in the patched version are.  But until then, the
> > patch should follow *the actual rules* not some other protocol based
> > on a hand-wavy explanation in an email someplace. Otherwise, you've
> > got the same sort of undocumented disaster-waiting-to-happen that you
> > keep complaining about in other parts of this patch.  We need fewer of
> > those, not more!
>
> But that's not what I'm asking for? I don't even know where you take
> from that I don't want this to be documented. I'm mainly asking for a
> comment explaining why the current behaviour is what it is. Because I
> don't think an *implicit* "normal WAL logging rules" is sufficient
> explanation, because all the locking here happens one or two layers away
> from the WAL logging site - so it's absolutely *NOT* obvious that that's
> the explanation. And I don't think any of the locking sites actually has
> comments explaining why the locks are acquired at that time (in fact,
> IIRC until the review some even only mentioned pinning, not locking).

I didn't intend to suggest that you don't want this to be documented.
What I intended to suggest was that you seem to want to deviate from
the documented rules, and it seems to me that we shouldn't do that
unless we change the rules first, and I don't know what you think the
rules should be or why those rules are safe.

I think I basically agree with you about the rest of this: the API
needs to be non-confusing and adequately documented, and it should
avoiding acquiring buffer locks until we have all the relevant pins.

> I think what primarily makes me concerned is that it's not clear to me
> what guarantees that discard is the only reason for the block to
> potentially be missing. I contrast to most other similar cases where WAL
> replay simply re-creates the objects when trying to replay an action
> affecting such an object, here we simply skip over the WAL logged
> operation. So if e.g. the entire underlying UNDO file got lost, we
> neither re-create it with valid content, nor error out. Which means we
> got to be absolutely sure that all undo files are created in a
> persistent manner, at their full size. And that there's no way that data
> could get lost, without forcing us to perform REDO up to at least the
> relevant point again.

I think the crucial question for me here is the extent to which we're
cross-checking against the discard pointer.  If we're like, "oh, this
undo data isn't on disk any more, it must've already been discarded,
let's ignore the write," that doesn't sound particularly great,
because files sometimes go missing. But, if we're like, "oh, we
dirtied this undo buffer but now that undo has been discarded so we
don't need to write the data back to the backing file," that seems
fine. The discard pointer is a fully durable, WAL-logged thing; if
it's somehow wrong, we have got huge problems anyway.

> While it appears that we always WAL log the undo extension, I am not
> convinced the recovery interlock is strong enough. For one
> UndoLogDiscard() unlinks segments before WAL logging their removal -
> which means if we crash after unlink() and before the
> XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in
> practice we might be fine, because there ought to be nobody still
> referencing that UNDO - but I don't think that's actually guaranteed as
> is).

Hmm, that sounds a little worrying.  I think there are two options
here: unlike what we do with buffers, where we can use buffer locking
etc. to make the insertion of the WAL record effectively simultaneous
with the changes to the data pages, the removal of old undo files has
to happen either before or after XLogInsert(). I think "after" would
be better.  If we do it before, then upon starting up, we have to
accept that there might be undo which is not officially discarded
which nevertheless no longer exists on disk; but that might also cause
us to ignore real corruption.  If we do it after, then we can just
treat it as a non-critical cleanup that can be performed lazily and at
leisure: at any time, without warning, the system may choose to remove
any or all undo backing files all of whose address space is discarded.
If we fail to remove files, we can just emit a WARNING and maybe retry
later at some convenient point in time, or perhaps even just accept
that we'll leak the file in that case.

> Nor do I see where we're updating minRecoveryLocation when
> replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during
> recovery could be stopped before the discard has been replayed, leaving
> us with wrong UNDO, but allowing write acess. Seems we'd at least need a
> few more XLogFlush() calls.

That sounds like a problem, but it seems like it might be better to
make sure that minRecoveryLocation gets bumped, rather than adding
XLogFlush() calls.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

Reply via email to