On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > I have done some more review of undolog patch series and here are my comments:
Hi Amit, Thanks! There a number of actionable changes in your review. I'll be posting a new patch set soon that will address most of your complaints individually. In this message want to respond to one topic area, because the answer is long enough already: > 2. > allocate_empty_undo_segment() > { > .. > .. > /* Flush the contents of the file to disk before the next checkpoint. */ > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); > .. > } > > +void > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace) > +{ > + char path[MAXPGPATH]; > + FileTag tag; > + > + INIT_UNDOFILETAG(tag, logno, tablespace, segno); > + > + /* Try to send to the checkpointer, but if out of space, do it here. */ > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false)) > > > The comment in allocate_empty_undo_segment indicates that the code > wants to flush before checkpoint, but the actual function tries to > register the request with checkpointer. Shouldn't this be similar to > XLogFileInit where we use pg_fsync to flush the contents immediately? > I guess that will avoid what you have written in comments in the same > function (we just want to make sure that the filesystem has allocated > physical blocks for it so that non-COW filesystems will report ENOSPC > now rather than later when space is needed). OTOH, I think it is > performance-wise better to postpone the work to checkpointer. If we > want to push this work to checkpointer, then we might need to change > comments or alternatively, we might want to use bigger segment sizes > to mitigate the performance effect. In an early version I was doing the fsync() immediately. While testing zheap, Mithun CY reported that whenever segments couldn't be recycled in the background, such as during a bit long-running transaction, he could measure ~6% of the time time spent waiting for fsync(), and throughput increased with bigger segments (and thus fewer files to fsync()). Passing the work off to the checkpointer is better not only because it's done in the background but also because there is a chance that the work can be consolidated with other sync requests, and perhaps even avoided completely if the file is discarded and unlinked before the next checkpoint. I'll update the comment to make it clearer. > If my above understanding is correct and the reason to fsync > immediately is to reserve space now, then we also need to think > whether we are always safe in postponing the work? Basically, if this > means that it can fail when we are actually trying to write undo, then > it could be risky because we could be in the critical section at that > time. I am not sure about this point, rather it is just to discuss if > there are any impacts of postponing the fsync work. Here is my theory for why this arrangement is safe, and why it differs from what we're doing with WAL segments and regular relation files. First, let's review why those things work the way they do (as I understand it): 1. WAL's use of fdatasync(): The reason we fill and then fsync() newly created WAL files up front is because we want to make sure the blocks are definitely on disk. The comment doesn't spell out exactly why the author considered later fdatasync() calls to be insufficient, but they were: it was many years after commit 33cc5d8a4d0d that Linux ext3/4 filesystems began flushing file size changes to disk in fdatasync()[1][2]. I don't know if its original behaviour was intentional or not. So, if you didn't use the bigger fsync() hammer on that OS, you might lose the end of a recently extended file in a power failure even though fdatasync() had returned success. By my reading of POSIX, that shouldn't be necessary on a conforming implementation of fdatasync(), and that was fixed years ago in Linux. I'm not proposing any changes there, and I'm not proposing to take advantage of that in the new code. I'm pointing out that that we don't have to worry about that for these undo segments, because they are already flushed with fsync(), not fdatasync(). (To understand POSIX's descriptions of fsync() and fdatasync() you have to find the meanings of "Synchronized I/O Data Integrity Completion" and "Synchronized I/O File Integrity Completion" elsewhere in the spec. TL;DR: fdatasync() is only allowed to skip flushing attributes like the modified time, it's not allowed to skip flushing a file size change since that would interfere with retrieving the data.) 2. Time of reservation: Although they don't call fsync(), regular relations and these new undo files still write zeroes up front (respectively, for a new block and for a new segment). One reason for that is that most popular filesystems reserve space at write time, so you'll get ENOSPC when trying to allocate undo space, and that's a non-fatal ERROR. If we deferred until writing back buffer contents, we might get file holes, and deferred ENOSPC is much harder to report to users and for users to deal with. You can still get a ENOSPC at checkpoint write-back time on COW systems like ZFS, and there is not much I can do about that. You can still get ENOSPC at checkpoint fsync() time on NFS, and there's not much we can do about that for now except panic (without direct IO, or other big changes). 3. Separate size tracking: Another reason that regular relations write out zeroes at relation-extension time is that that's the only place that the size of a relation is recorded. PostgreSQL doesn't track the number of blocks itself, so we can't defer file extension until write-back from our buffer pool. Undo doesn't rely on the filesystem to track the amount of undo data, it has its own crash-safe tracking of the discard and end pointers, which can be used to know which segment files exist and what ranges contain data. That allows us to work in whole files at a time, like WAL logs, even though we still have checkpoint-based flushing rules. To summarise, we write zeroes so we can report ENOSPC errors as early as possible, but we defer and consolidate fsync() calls because the files' contents and names don't actually have to survive power loss until a checkpoint says they existed at that point in the WAL stream. Does this make sense? BTW we could probably use posix_fallocate() instead of writing zeroes; I think Andres mentioned that recently. I see also that someone tried that for WAL and it got reverted back in 2013 (commit b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down the discussion). [1] https://lkml.org/lkml/2012/9/3/83 [2] https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5 -- Thomas Munro https://enterprisedb.com