Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro Wed, 24 Jul 2019 22:53:07 -0700

On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> I have done some more review of undolog patch series and here are my comments:


Hi Amit,

Thanks!  There a number of actionable changes in your review.  I'll be
posting a new patch set soon that will address most of your complaints
individually.  In this message want to respond to one topic area,
because the answer is long enough already:

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }
>
> +void
> +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
> +{
> + char path[MAXPGPATH];
> + FileTag tag;
> +
> + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> +
> + /* Try to send to the checkpointer, but if out of space, do it here. */
> + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
>
>
> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer.  Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?
> I guess that will avoid what you have written in comments in the same
> function (we just want to make sure that the filesystem has allocated
> physical blocks for it so that non-COW filesystems will report ENOSPC
> now rather than later when space is needed).  OTOH, I think it is
> performance-wise better to postpone the work to checkpointer.  If we
> want to push this work to checkpointer, then we might need to change
> comments or alternatively, we might want to use bigger segment sizes
> to mitigate the performance effect.

In an early version I was doing the fsync() immediately.  While
testing zheap, Mithun CY reported that whenever segments couldn't be
recycled in the background, such as during a bit long-running
transaction, he could measure ~6% of the time time spent waiting for
fsync(), and throughput increased with bigger segments (and thus fewer
files to fsync()).  Passing the work off to the checkpointer is better
not only because it's done in the background but also because there is
a chance that the work can be consolidated with other sync requests,
and perhaps even avoided completely if the file is discarded and
unlinked before the next checkpoint.

I'll update the comment to make it clearer.

> If my above understanding is correct and the reason to fsync
> immediately is to reserve space now, then we also need to think
> whether we are always safe in postponing the work?  Basically, if this
> means that it can fail when we are actually trying to write undo, then
> it could be risky because we could be in the critical section at that
> time.  I am not sure about this point, rather it is just to discuss if
> there are any impacts of postponing the fsync work.

Here is my theory for why this arrangement is safe, and why it differs
from what we're doing with WAL segments and regular relation files.
First, let's review why those things work the way they do (as I
understand it):

1.  WAL's use of fdatasync():  The reason we fill and then fsync()
newly created WAL files up front is because we want to make sure the
blocks are definitely on disk.  The comment doesn't spell out exactly
why the author considered later fdatasync() calls to be insufficient,
but they were: it was many years after commit 33cc5d8a4d0d that Linux
ext3/4 filesystems began flushing file size changes to disk in
fdatasync()[1][2].  I don't know if its original behaviour was
intentional or not.  So, if you didn't use the bigger fsync() hammer
on that OS, you might lose the end of a recently extended file in a
power failure even though fdatasync() had returned success.

By my reading of POSIX, that shouldn't be necessary on a conforming
implementation of fdatasync(), and that was fixed years ago in Linux.
I'm not proposing any changes there, and I'm not proposing to take
advantage of that in the new code.  I'm pointing out that that we
don't have to worry about that for these undo segments, because they
are already flushed with fsync(), not fdatasync().

(To understand POSIX's descriptions of fsync() and fdatasync() you
have to find the meanings of "Synchronized I/O Data Integrity
Completion" and "Synchronized I/O File Integrity Completion" elsewhere
in the spec.  TL;DR: fdatasync() is only allowed to skip flushing
attributes like the modified time, it's not allowed to skip flushing a
file size change since that would interfere with retrieving the data.)

2.  Time of reservation:  Although they don't call fsync(), regular
relations and these new undo files still write zeroes up front
(respectively, for a new block and for a new segment).  One reason for
that is that most popular filesystems reserve space at write time, so
you'll get ENOSPC when trying to allocate undo space, and that's a
non-fatal ERROR.  If we deferred until writing back buffer contents,
we might get file holes, and deferred ENOSPC is much harder to report
to users and for users to deal with.

You can still get a ENOSPC at checkpoint write-back time on COW
systems like ZFS, and there is not much I can do about that.  You can
still get ENOSPC at checkpoint fsync() time on NFS, and there's not
much we can do about that for now except panic (without direct IO, or
other big changes).

3.  Separate size tracking:  Another reason that regular relations
write out zeroes at relation-extension time is that that's the only
place that the size of a relation is recorded.  PostgreSQL doesn't
track the number of blocks itself, so we can't defer file extension
until write-back from our buffer pool.  Undo doesn't rely on the
filesystem to track the amount of undo data, it has its own crash-safe
tracking of the discard and end pointers, which can be used to know
which segment files exist and what ranges contain data.  That allows
us to work in whole files at a time, like WAL logs, even though we
still have checkpoint-based flushing rules.

To summarise, we write zeroes so we can report ENOSPC errors as early
as possible, but we defer and consolidate fsync() calls because the
files' contents and names don't actually have to survive power loss
until a checkpoint says they existed at that point in the WAL stream.

Does this make sense?

BTW we could probably use posix_fallocate() instead of writing zeroes;
I think Andres mentioned that recently.  I see also that someone tried
that for WAL and it got reverted back in 2013 (commit
b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down
the discussion).

[1] https://lkml.org/lkml/2012/9/3/83
[2] 
https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

Reply via email to