Hi Andres, Robert. I've attached four patches here.
1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to earlier in StartupXLOG. 2. Inside that function, issue fsync()s for the main forks we create by copying the _init fork. 3. A small fixup to add a const to "typedef char *FileName", because the earlier patch gave me warnings about discarding const-ness. This is consistent with many other functions in fd.c that take const char *. 4. Issue an fsync() on the data directory at startup if we need to perform crash recovery. I created some unlogged relations, performed an immediate shutdown, and then straced postgres as it performed crash recovery. The changes in (2) do indeed fsync the files we create by copying *_init, and don't fsync anything else (at least not after I fixed a bug ;-). I did not do anything about the END_OF_RECOVERY checkpoint setting the ControlFile->state to DB_SHUTDOWNED, because it wasn't clear to me if there was any agreement on what to do. I would be happy to submit a followup patch if we reach some decision about it. Is this what you had in mind? -- Abhijit
>From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Wed, 24 Sep 2014 14:43:18 +0530 Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier We need to call this after recovery, but not after the END_OF_RECOVERY checkpoint is written. If we crash after that checkpoint, for example because of ENOSPC in PreallocXlogFiles, the checkpoint has already set the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery again at startup. Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) in the first cleanup, this leaves the database with a bunch of _init forks for unlogged relations, but no main forks, leading to scary errors. See thread from 20140912112246.ga4...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 46eef5f..218f7fb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6863,6 +6863,14 @@ StartupXLOG(void) ShutdownWalRcv(); /* + * Reset initial contents of unlogged relations. This has to be done + * AFTER recovery is complete so that any unlogged relations created + * during recovery also get picked up. + */ + if (InRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_INIT); + + /* * We don't need the latch anymore. It's not strictly necessary to disown * it, but let's do it for the sake of tidiness. */ @@ -7130,14 +7138,6 @@ StartupXLOG(void) PreallocXlogFiles(EndOfLog); /* - * Reset initial contents of unlogged relations. This has to be done - * AFTER recovery is complete so that any unlogged relations created - * during recovery also get picked up. - */ - if (InRecovery) - ResetUnloggedRelations(UNLOGGED_RELATION_INIT); - - /* * Okay, we're officially UP. */ InRecovery = false; -- 1.9.1
>From 7eba57b5ed9e1eda1fa1b14b32a82828617d823e Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Wed, 24 Sep 2014 16:01:37 +0530 Subject: =?UTF-8?q?Make=20ResetUnloggedRelations(=E2=80=A6=5FINIT)=20fsync?= =?UTF-8?q?=20newly-created=20main=20forks?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit During UNLOGGED_RELATION_INIT, after we have copied ${x}_init to $x, we issue fsync()s for the newly-created files. We depend on their existence and a checkpoint isn't going to fsync them for us during recovery. See thread from 20140912112246.ga4...@alap3.anarazel.de for details. --- src/backend/storage/file/reinit.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 3229f41..4febf6f 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -339,6 +339,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) } FreeDir(dbspace_dir); + + /* + * copy_file() above has already called pg_flush_data() on the + * files it created. Now we need to fsync those files, because + * a checkpoint won't do it for us while we're in recovery. + */ + + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + ForkNumber forkNum; + int oidchars; + char oidbuf[OIDCHARS + 1]; + char mainpath[MAXPGPATH]; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum)) + continue; + + /* Also skip it unless this is the init fork. */ + if (forkNum != INIT_FORKNUM) + continue; + + /* Construct main fork pathname. */ + memcpy(oidbuf, de->d_name, oidchars); + oidbuf[oidchars] = '\0'; + snprintf(mainpath, sizeof(mainpath), "%s/%s%s", + dbspacedirname, oidbuf, de->d_name + oidchars + 1 + + strlen(forkNames[INIT_FORKNUM])); + + fsync_fname(mainpath, false); + } + FreeDir(dbspace_dir); + + fsync_fname(dbspacedirname, true); } } -- 1.9.1
>From 2b7eb360cd8789bb75cc9c02b85b1df6a9903f0f Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Wed, 24 Sep 2014 16:07:06 +0530 Subject: Add const to FileName typedef, make fsync_fname use it The filename arguments aren't modified anyway, and this avoids warnings when fsync_fname is called from ResetUnloggedRelationsInDbspaceDir. It is also consistent with several other functions in fd.c that already take const char * arguments. --- src/backend/storage/file/fd.c | 2 +- src/include/storage/fd.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 1f69c9e..33ab471 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -390,7 +390,7 @@ pg_flush_data(int fd, off_t offset, off_t amount) * indicate the OS just doesn't allow/require fsyncing directories. */ void -fsync_fname(char *fname, bool isdir) +fsync_fname(FileName fname, bool isdir) { int fd; int returncode; diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index a6df8fb..2fa4e18 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -46,7 +46,7 @@ * FileSeek uses the standard UNIX lseek(2) flags. */ -typedef char *FileName; +typedef const char *FileName; typedef int File; @@ -113,7 +113,7 @@ extern int pg_fsync_no_writethrough(int fd); extern int pg_fsync_writethrough(int fd); extern int pg_fdatasync(int fd); extern int pg_flush_data(int fd, off_t offset, off_t amount); -extern void fsync_fname(char *fname, bool isdir); +extern void fsync_fname(FileName fname, bool isdir); /* Filename components for OpenTemporaryFile */ #define PG_TEMP_FILES_DIR "pgsql_tmp" -- 1.9.1
>From bc07b596b8778258da661d7c9aea1f4c0d00a2e0 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen <a...@2ndquadrant.com> Date: Wed, 24 Sep 2014 16:20:43 +0530 Subject: If we need to perform crash recovery, fsync the data directory This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 218f7fb..95d57cb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5973,6 +5973,18 @@ StartupXLOG(void) ereport(FATAL, (errmsg("control file contains invalid data"))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory to try to ensure that any data written before the + * crash are flushed to disk. Otherwise a power failure in the near + * future might mean that earlier unflushed writes are lost, but the + * more recent data written to disk from here on are persisted. + */ + + if (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + fsync_fname(data_directory, true); + if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers