On Thu, Jul 25, 2019 at 11:22 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > I have done some more review of undolog patch series and here are my > > comments: > > Hi Amit, > > Thanks! There a number of actionable changes in your review. I'll be > posting a new patch set soon that will address most of your complaints > individually. In this message want to respond to one topic area, > because the answer is long enough already: > > > 2. > > allocate_empty_undo_segment() > > { > > .. > > .. > > /* Flush the contents of the file to disk before the next checkpoint. */ > > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); > > .. > > } > > > > +void > > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid > > tablespace) > > +{ > > + char path[MAXPGPATH]; > > + FileTag tag; > > + > > + INIT_UNDOFILETAG(tag, logno, tablespace, segno); > > + > > + /* Try to send to the checkpointer, but if out of space, do it here. */ > > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false)) > > > > > > The comment in allocate_empty_undo_segment indicates that the code > > wants to flush before checkpoint, but the actual function tries to > > register the request with checkpointer. Shouldn't this be similar to > > XLogFileInit where we use pg_fsync to flush the contents immediately? > > I guess that will avoid what you have written in comments in the same > > function (we just want to make sure that the filesystem has allocated > > physical blocks for it so that non-COW filesystems will report ENOSPC > > now rather than later when space is needed). OTOH, I think it is > > performance-wise better to postpone the work to checkpointer. If we > > want to push this work to checkpointer, then we might need to change > > comments or alternatively, we might want to use bigger segment sizes > > to mitigate the performance effect. > > In an early version I was doing the fsync() immediately. While > testing zheap, Mithun CY reported that whenever segments couldn't be > recycled in the background, such as during a bit long-running > transaction, he could measure ~6% of the time time spent waiting for > fsync(), and throughput increased with bigger segments (and thus fewer > files to fsync()). Passing the work off to the checkpointer is better > not only because it's done in the background but also because there is > a chance that the work can be consolidated with other sync requests, > and perhaps even avoided completely if the file is discarded and > unlinked before the next checkpoint. > > I'll update the comment to make it clearer. >
Okay, that makes sense. > > If my above understanding is correct and the reason to fsync > > immediately is to reserve space now, then we also need to think > > whether we are always safe in postponing the work? Basically, if this > > means that it can fail when we are actually trying to write undo, then > > it could be risky because we could be in the critical section at that > > time. I am not sure about this point, rather it is just to discuss if > > there are any impacts of postponing the fsync work. > > Here is my theory for why this arrangement is safe, and why it differs > from what we're doing with WAL segments and regular relation files. > First, let's review why those things work the way they do (as I > understand it): > > 1. WAL's use of fdatasync(): > I was referring to function XLogFileInit which doesn't appear to be directly using fdatasync. > > 3. Separate size tracking: Another reason that regular relations > write out zeroes at relation-extension time is that that's the only .. > > To summarise, we write zeroes so we can report ENOSPC errors as early > as possible, but we defer and consolidate fsync() calls because the > files' contents and names don't actually have to survive power loss > until a checkpoint says they existed at that point in the WAL stream. > > Does this make sense? > Yes, this makes sense. However, I wonder if we need to do some special handling for ENOSPC while writing to file in this function (allocate_empty_undo_segment). Basically, unlink/remove the file if fail to write because of disk full, something similar to what we do in XLogFileInit. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com