Re: Creation of an empty table is not fsync'd at checkpoint

Thomas Munro Thu, 27 Jan 2022 19:03:08 -0800

On Fri, Jan 28, 2022 at 11:39 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Hmm, if a relation is dropped, we use plain unlink() to delete it (at
> the next checkpoint). Should we use durable_unlink() there, or otherwise
> arrange to fsync() the parent directory?


Hmmmmm.  I think the latter might be a good idea, but not because of
the file we unlink after checkpoint.  Rationale:  On commit, we
truncate all segments, and unlink segments > 0.  After checkpoint, we
unlink segment 0 (the tombstone preventing relfilenode recycling).
So, it might close some storage leak windows if we did:

1.  register_dirty_segment() on truncated segment 0, so that at
checkpoint it is fsynced.  That means that if we lose power between
the checkpoint and the unlink(), at least its size is durably zero and
not 1GB.  I don't know how to completely avoid leaking empty tombstone
files if you lose power in that window.  durable_unlink() may narrow
the window but it's still on the wrong side of a checkpoint and won't
be replayed on crash recovery.  I hope we can get rid of tombstones
completely as Dilip is attempting in [1].

2.  fsync() the containing directory as part of checkpointing, so the
unlinks of non-tombstone segments are made durable at checkpoint.
Otherwise, after checkpoint, you might lose power, come back up and
find the segments are still present in the directory, and worse, not
truncated.

AFAICS we have no defences against zombie segments > 0 if the
tombstone is gone, allowing recycling.  Zombie segments could appear
to be concatenated to the next user of the relfilenode.  That problem
goes away with Dilip's project.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BfTEcYQvc18aEbrJjPri99A09JZcEXzXjT5h57S%2BAgkw%40mail.gmail.com

Re: Creation of an empty table is not fsync'd at checkpoint

Reply via email to