On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > For the record, Andres Freund mentioned a few problems with this > > off-list and suggested we consider calling Linux syncfs() for each top > > level directory that could potentially be on a different filesystem. > > That seems like a nice idea to look into. > > Here's an experimental patch to try that. One problem is that before > Linux 5.8, syncfs() doesn't report failures[1]. I'm not sure what to > think about that; in the current coding we just log them and carry on > anyway, but any theoretical problems that causes for BLK_DONE should > be moot anyway because of FPIs which result in more writes and syncs. > Another is that it may affect other files that aren't under pgdata as > collateral damage, but that seems acceptable. It also makes me a bit > sad that this wouldn't help other OSes.
... and for comparison/discussion, here is an alternative patch that figures out precisely which files need to be fsync'd using information in the WAL. On a system with full_page_writes=on, this effectively means that we don't have to do anything at all for relation files, as described in more detail in the commit message. You still need to fsync the WAL files to make sure you can't replay records from the log that were written but not yet fdatasync'd, addressed in the patch. I'm not yet sure which other kinds of special files might need special treatment. Some thoughts: 1. Both patches avoid the need to open many files. With 1 million tables this can take over a minute even on a fast system with warm caches and/or fast local storage, before replay begins, and for a cold system with high latency storage it can be a serious problem. 2. The syncfs() one is comparatively simple, but it only works on Linux. If you used sync() (or sync(); sync()) instead, you'd be relying on non-POSIX behaviour, because POSIX says it doesn't wait for completion and indeed many non-Linux systems really are like that. 3. Though we know of kernel/filesystem pairs that can mark buffers clean while retaining the dirty contents (which would cause corruption with the current code in master, or either of these patches), fortunately those systems can't possibly run with full_page_writes=off so such problems are handled the same way torn pages are fixed. 4. Perhaps you could set a flag in the buffer to say 'needs sync' as a way to avoid repeatedly requesting syncs for the same page, but it might not be worth the hassle involved. Some other considerations that have been mentioned to me by colleagues I talked to about this: 1. The ResetUnloggedRelations() code still has to visit all relation files looking for init forks after a crash. But that turns out to be OK, it only reads directory entries in a straight line. It doesn't stat() or open() files with non-matching names, so unless you have very many unlogged tables, the problem is already avoided. (It also calls fsync() on the result, which could perhaps be replaced with a deferred request too, not sure, for another day.) 2. There may be some more directories that need special fsync() treatment. SLRUs are already covered (either handled by checkpoint or they hold ephemeral data), and I think pg_tblspc changes will be redone. pg_logical? I am not sure.
From 5becb011e7657f6aa46d9d0a834d3b176dbb589c Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sat, 26 Sep 2020 23:07:48 +1200 Subject: [PATCH] Optimize fsync() calls in crash recovery. Previously, we'd open every file under pgdata and call fsync(), before entering crash recovery. That's because it's important to avoid relying on dirty data in the kernel's cache when deciding that a change has already been applied. Instead of doing work in proportion to the number of relations (which could be extremely slow with a cold VFS cache and high latency storage), switch to doing work in proportion to the number of relations that we opted not to dirty (in PostgreSQL's own buffer pool) based on page LSNs. Typically, this means doing no work at all, because most users run with full_page_writes=on, meaning that we'll restore the earliest version of the page and then replay every WAL record on top of it (that is, we'll never skip a record, and always dirty, write back and eventually sync the page). With full_page_writes=off, we get to skip dirtying pages that have higher LSNs ("BLK_DONE"), but now in that case we'll still queue up a request to issue an fsync() call in the next checkpoint whenever that happens. Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com --- src/backend/access/transam/xlog.c | 41 ++++---- src/backend/access/transam/xlogutils.c | 9 ++ src/backend/storage/buffer/bufmgr.c | 20 ++++ src/backend/storage/file/fd.c | 137 ------------------------- src/backend/storage/smgr/md.c | 15 +++ src/backend/storage/smgr/smgr.c | 15 +++ src/include/storage/bufmgr.h | 1 + src/include/storage/md.h | 1 + src/include/storage/smgr.h | 2 + 9 files changed, 83 insertions(+), 158 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 52a67b1170..8701197fc1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -917,7 +917,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr); static void XLogFileClose(void); static void PreallocXlogFiles(XLogRecPtr endptr); -static void RemoveTempXlogFiles(void); +static void ScanXlogFilesAfterCrash(void); static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr); static void RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr); static void UpdateLastRemovedPtr(char *filename); @@ -3987,34 +3987,34 @@ UpdateLastRemovedPtr(char *filename) } /* - * Remove all temporary log files in pg_wal + * Remove all temporary log files in pg_wal, and make sure that all remaining + * files are down on disk before we replay anything in them. * * This is called at the beginning of recovery after a previous crash, * at a point where no other processes write fresh WAL data. */ static void -RemoveTempXlogFiles(void) +ScanXlogFilesAfterCrash(void) { DIR *xldir; struct dirent *xlde; - elog(DEBUG2, "removing all temporary WAL segments"); - xldir = AllocateDir(XLOGDIR); while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL) { char path[MAXPGPATH]; - if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0) - continue; - snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name); - unlink(path); - elog(DEBUG2, "removed temporary WAL segment \"%s\"", path); + + if (strncmp(xlde->d_name, "xlogtemp.", 9) == 0) + unlink(path); + else if (IsXLogFileName(xlde->d_name)) + fsync_fname(path, false); } FreeDir(xldir); } + /* * Recycle or remove all log files older or equal to passed segno. * @@ -6407,25 +6407,24 @@ StartupXLOG(void) ValidateXLOGDirectoryStructure(); /*---------- - * If we previously crashed, perform a couple of actions: + * If we previously crashed: * * - The pg_wal directory may still include some temporary WAL segments * used when creating a new segment, so perform some clean up to not - * bloat this path. This is done first as there is no point to sync - * this temporary data. + * bloat this path. + * + * - It's possible that some WAL data had been written but not yet synced. + * Therefore we have to sync these files before replaying the records + * they contain. * * - There might be data which we had written, intending to fsync it, but - * which we had not actually fsync'd yet. Therefore, a power failure in - * the near future might cause earlier unflushed writes to be lost, even - * though more recent data written to disk from here on would be - * persisted. To avoid that, fsync the entire data directory. + * which we had not actually fsync'd yet. XLogReadBufferForRedo() will + * compute the set of files that may need fsyncing at the next + * checkpoint, during recovery. */ if (ControlFile->state != DB_SHUTDOWNED && ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) - { - RemoveTempXlogFiles(); - SyncDataDirectory(); - } + ScanXlogFilesAfterCrash(); /* * Initialize on the assumption we want to recover to the latest timeline diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 7e915bcadf..04ee31114a 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -402,7 +402,16 @@ XLogReadBufferForRedoExtended(XLogReaderState *record, LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE); } if (lsn <= PageGetLSN(BufferGetPage(*buf))) + { + /* + * The page is from the future. We must be in crash recovery. + * We don't need to redo the record, but we do need to make + * sure that the image as we have seen it is durably stored on + * disk as part of the next checkpoint. + */ + RequestBufferSync(*buf); return BLK_DONE; + } else return BLK_NEEDS_REDO; } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index e549fa1d30..a7fcf0a299 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1515,6 +1515,26 @@ MarkBufferDirty(Buffer buffer) } } +/* + * Request that the file containing a buffer be fsynced at the next checkpoint. + * This is only used in crash recovery, to make it safe to skip applying WAL on + * the basis that the page already has a change. We want to make sure that the + * data we read from the kernel matches what's on disk at the next checkpoint. + */ +void +RequestBufferSync(Buffer buffer) +{ + SMgrRelation reln; + BufferDesc *bufHdr; + + Assert(BufferIsPinned(buffer)); + Assert(!BufferIsLocal(buffer)); + + bufHdr = GetBufferDescriptor(buffer - 1); + reln = smgropen(bufHdr->tag.rnode, InvalidBackendId); + smgrsync(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum); +} + /* * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer() * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index bd72a87ee3..bb6effca58 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -330,10 +330,6 @@ static void walkdir(const char *path, void (*action) (const char *fname, bool isdir, int elevel), bool process_symlinks, int elevel); -#ifdef PG_FLUSH_DATA_WORKS -static void pre_sync_fname(const char *fname, bool isdir, int elevel); -#endif -static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); static int fsync_parent_path(const char *fname, int elevel); @@ -3232,86 +3228,6 @@ looks_like_temp_rel_name(const char *name) } -/* - * Issue fsync recursively on PGDATA and all its contents. - * - * We fsync regular files and directories wherever they are, but we - * follow symlinks only for pg_wal and immediately under pg_tblspc. - * Other symlinks are presumed to point at files we're not responsible - * for fsyncing, and might not have privileges to write at all. - * - * Errors are logged but not considered fatal; that's because this is used - * only during database startup, to deal with the possibility that there are - * issued-but-unsynced writes pending against the data directory. We want to - * ensure that such writes reach disk before anything that's done in the new - * run. However, aborting on error would result in failure to start for - * harmless cases such as read-only files in the data directory, and that's - * not good either. - * - * Note that if we previously crashed due to a PANIC on fsync(), we'll be - * rewriting all changes again during recovery. - * - * Note we assume we're chdir'd into PGDATA to begin with. - */ -void -SyncDataDirectory(void) -{ - bool xlog_is_symlink; - - /* We can skip this whole thing if fsync is disabled. */ - if (!enableFsync) - return; - - /* - * If pg_wal is a symlink, we'll need to recurse into it separately, - * because the first walkdir below will ignore it. - */ - xlog_is_symlink = false; - -#ifndef WIN32 - { - struct stat st; - - if (lstat("pg_wal", &st) < 0) - ereport(LOG, - (errcode_for_file_access(), - errmsg("could not stat file \"%s\": %m", - "pg_wal"))); - else if (S_ISLNK(st.st_mode)) - xlog_is_symlink = true; - } -#else - if (pgwin32_is_junction("pg_wal")) - xlog_is_symlink = true; -#endif - - /* - * If possible, hint to the kernel that we're soon going to fsync the data - * directory and its contents. Errors in this step are even less - * interesting than normal, so log them only at DEBUG1. - */ -#ifdef PG_FLUSH_DATA_WORKS - walkdir(".", pre_sync_fname, false, DEBUG1); - if (xlog_is_symlink) - walkdir("pg_wal", pre_sync_fname, false, DEBUG1); - walkdir("pg_tblspc", pre_sync_fname, true, DEBUG1); -#endif - - /* - * Now we do the fsync()s in the same order. - * - * The main call ignores symlinks, so in addition to specially processing - * pg_wal if it's a symlink, pg_tblspc has to be visited separately with - * process_symlinks = true. Note that if there are any plain directories - * in pg_tblspc, they'll get fsync'd twice. That's not an expected case - * so we don't worry about optimizing it. - */ - walkdir(".", datadir_fsync_fname, false, LOG); - if (xlog_is_symlink) - walkdir("pg_wal", datadir_fsync_fname, false, LOG); - walkdir("pg_tblspc", datadir_fsync_fname, true, LOG); -} - /* * walkdir: recursively walk a directory, applying the action to each * regular file and directory (including the named directory itself). @@ -3382,59 +3298,6 @@ walkdir(const char *path, } -/* - * Hint to the OS that it should get ready to fsync() this file. - * - * Ignores errors trying to open unreadable files, and logs other errors at a - * caller-specified level. - */ -#ifdef PG_FLUSH_DATA_WORKS - -static void -pre_sync_fname(const char *fname, bool isdir, int elevel) -{ - int fd; - - /* Don't try to flush directories, it'll likely just fail */ - if (isdir) - return; - - fd = OpenTransientFile(fname, O_RDONLY | PG_BINARY); - - if (fd < 0) - { - if (errno == EACCES) - return; - ereport(elevel, - (errcode_for_file_access(), - errmsg("could not open file \"%s\": %m", fname))); - return; - } - - /* - * pg_flush_data() ignores errors, which is ok because this is only a - * hint. - */ - pg_flush_data(fd, 0, 0); - - if (CloseTransientFile(fd) != 0) - ereport(elevel, - (errcode_for_file_access(), - errmsg("could not close file \"%s\": %m", fname))); -} - -#endif /* PG_FLUSH_DATA_WORKS */ - -static void -datadir_fsync_fname(const char *fname, bool isdir, int elevel) -{ - /* - * We want to silently ignoring errors about unreadable files. Pass that - * desire on to fsync_fname_ext(). - */ - fsync_fname_ext(fname, isdir, true, elevel); -} - static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel) { diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1d4aa482cc..e005c85022 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -934,6 +934,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum) } } +/* + * mdsync() -- Schedule a sync at the next checkpoint. + * + * This is useful in crash recovery, to ensure that data we've read from the + * kernel matches what is on disk before a checkpoint. + */ +void +mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) +{ + MdfdVec *seg; + + seg = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL); + register_dirty_segment(reln, forknum, seg); +} + /* * register_dirty_segment() -- Mark a relation segment as needing fsync * diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index dcc09df0c7..32c871cac8 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,8 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_sync) (SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum); } f_smgr; static const f_smgr smgrsw[] = { @@ -79,6 +81,7 @@ static const f_smgr smgrsw[] = { .smgr_read = mdread, .smgr_write = mdwrite, .smgr_writeback = mdwriteback, + .smgr_sync = mdsync, .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, @@ -541,6 +544,18 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, nblocks); } + +/* + * smgrsync() -- Request that the file containing a block be flushed at the + * next checkpoint. + */ +void +smgrsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) +{ + smgrsw[reln->smgr_which].smgr_sync(reln, forknum, blocknum); +} + + /* * smgrnblocks() -- Calculate the number of blocks in the * supplied relation. diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ee91b8fa26..67e5167ff8 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -186,6 +186,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode, extern void ReleaseBuffer(Buffer buffer); extern void UnlockReleaseBuffer(Buffer buffer); extern void MarkBufferDirty(Buffer buffer); +extern void RequestBufferSync(Buffer buffer); extern void IncrBufferRefCount(Buffer buffer); extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation, BlockNumber blockNum); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 07fd1bb7d0..7c5846ddcd 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -40,6 +40,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum); extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index f28a842401..bbad2ce09c 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -102,6 +102,8 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum); extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nblocks); extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void smgrsync(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum); extern void AtEOXact_SMgr(void); #endif /* SMGR_H */ -- 2.20.1