Re: [HACKERS] Unlogged tables cleanup

Robert Haas Mon, 20 May 2019 06:17:42 -0700

On Tue, May 14, 2019 at 2:19 AM Andres Freund <and...@anarazel.de> wrote:
> How would this protect against the issues I mentioned where recovery
> starts from an earlier checkpoint and the basebackup could pick up a
> random set of version of the different forks?
>
> I don't like the idea of expanding the use of delayChkpt further for
> common operations, if anything we ought to try to reduce it. But I also
> don't see how it'd actually fix the issues, so perhaps that's moot.


Based on the discussion so far, I think there are a number of related
problems here:

1. A base backup might copy the main fork but not the init fork. I
think that's a problem that nobody's thought about before Andres
raised it on this thread.  I may be wrong. Anyway, if that happens,
the start-of-recovery code isn't going to do the right thing, because
it won't know that it's dealing with an unlogged relation.

2. Suppose the system reaches the end of
heapam_relation_set_new_filenode and then the system crashes.  Because
of the smgrimmedsync(), and only because of the smgrimmedsync(), the
init fork would exist at the start of recovery.  However, unlike the
heap, some of the index AMs don't actually have a call to
smgrimmedsync() in their *buildempty() functions.  If I'm not
mistaken, the ones that write pages through shared_buffers do not do
smgrimmedsync, and the ones that use private buffers do perform
smgrimmedsync.  Therefore, the ones that write pages through
shared_buffers are vulnerable to a crash right after the unlogged
table is created.  The init fork could fail to survive the crash, and
therefore the start-of-recovery code would again fail to know that
it's dealign with an unlogged relation.

3. So, why is it like that, anyway?  Why do index AMs that write pages
via shared_buffers not have smgrimmedsync?  The answer is that we did
it like that because we were worrying about a different problem -
specifically, checkpointing.  If I dirty a buffer and write a WAL
record for the changes that I made, it is guaranteed that the dirty
data will make it to disk before the next checkpoint is written; we've
got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to
make sure that happens.  But if I create a file in the filesystem and
write a WAL record for that change, none of that interlocking does
anything.  So unless I not only create that file but smgrimmedsync()
it before the next checkpoint's redo location is fixed, a crash could
make the file disappear.

For example, consider RelationCreateStorage.  What prevents the
following sequence of events?

S1: sgmropen()
S1: smgrcreate()
S1: log_smgrcreate()
-- at this point the checkpointer fixes the redo pointer, writes every
dirty buffer in the system, and completes a checkpoint
S1: commits
-- crash

On restart, I think we'll could potentially be missing the file
created by smgrcreate(), and we won't replay the WAL record generated
by log_smgrcreate() either because it's before the checkpoint.

I believe that it is one aspect of this third problem that we
previously fixed on this thread, but I think we failed to understand
how general the issue actually is.  It affects _init forks of unlogged
tables, but I think it also affects essentially every other use of
smgrcreate(), whether it's got anything to do with unlogged tables or
not. For an index AM, which has a non-empty initial representation,
the consequences are pretty limited, because the transaction can't
commit having created only an empty fork.  It's got to put some data
into either the main fork or the init fork, as the case may be.  If it
aborts, then we may leave some orphaned files behind, but if we lose
one, we just have fewer orphaned files, so nobody cares.  And if it
commits, then either everything's before the checkpoint (and the file
will have been fsync'd because we fsync'd the buffers), or
everything's after the checkpoint (so we will replay the WAL), or only
the smgrcreate() is before the checkpoint and the rest is after (in
which case XLogReadBufferExtended will do the smgrcreate over again
for us and we'll be fine).  But since the heap uses an empty initial
representation, it enjoys no similar protection.

So, IIUC, the reason why we were talking about delayCkpt before is
because it would allow us to remove the smgrimmedsync() of the
unlogged fork while still avoiding problem #3.  However it does
nothing to protect against problem #1 or #2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Unlogged tables cleanup

Reply via email to