On Wed, Dec 18, 2024 at 08:51:20PM -0500, Andres Freund wrote: > I don't think the issue is actually quite as unlikely to be hit as reasoned in > the commit message. The crash has indeed to happen between the link() and > unlink() - but at the end of a checkpoint we do that operations hundreds of > times in a row on a busy server. And that's just after potentially doing lots > of write IO during a checkpoint, filling up drive write caches / eating up > IOPS/bandwidth disk quots.
Looks so, yep. Your timing and the report's timing are interesting. I've been double-checking the code to refresh myself with the problem, and I don't see a reason to not apply something like the attached set down to v13 for all these remaining branches (minus an edit of the commit message). Thoughts? -- Michael
From 168c7071541d231fc13aee424709524f2ed4976e Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@paquier.xyz> Date: Tue, 5 Jul 2022 10:16:12 +0900 Subject: [PATCH] Replace durable_rename_excl() by durable_rename(), take two durable_rename_excl() attempts to avoid overwriting any existing files by using link() and unlink(), and it falls back to rename() on some platforms (aka WIN32), which offers no such overwrite protection. Most callers use durable_rename_excl() just in case there is an existing file, but in practice there shouldn't be one (see below for more details). Furthermore, failures during durable_rename_excl() can result in multiple hard links to the same file. As per Nathan's tests, it is possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting, leading to WAL corruption. This change replaces all the calls of durable_rename_excl() to durable_rename(). This removes the protection against accidentally overwriting an existing file, but some platforms are already living without it and ordinarily there shouldn't be one. The function itself is left around in case any extensions are using it. It will be removed on HEAD via a follow-up commit. Here is a summary of the existing callers of durable_rename_excl() (see second discussion link at the bottom), replaced by this commit. First, basic_archive used it to avoid overwriting an archive concurrently created by another server, but as mentioned above, it will still overwrite files on some platforms. Second, xlog.c uses it to recycle past WAL segments, where an overwrite should not happen (origin of the change at f0e37a8) because there are protections about the WAL segment to select when recycling an entry. The third and last area is related to the write of timeline history files. writeTimeLineHistory() will write a new timeline history file at the end of recovery on promotion, so there should be no such files for the same timeline. What remains is writeTimeLineHistoryFile(), that can be used in parallel by a WAL receiver and the startup process, and some digging of the buildfarm shows that EEXIST from a WAL receiver can happen with an error of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\", which would cause an automatic restart of the WAL receiver as it is promoted to FATAL, hence this should improve the stability of the WAL receiver as rename() would overwrite an existing TLI history file already fetched by the startup process at recovery. This is a bug fix, but knowing the unlikeliness of the problem involving one or more crashes at an exceptionally bad moment, no backpatch is done. Also, I want to be careful with such changes (aaa3aed did the opposite of this change by removing HAVE_WORKING_LINK so as Windows would do a link() rather than a rename() but this was not concurrent-safe). A backpatch could be revisited in the future. This is the second time this change is attempted, ccfbd92 being the first one, but this time no assertions are added for the case of a TLI history file written concurrently by the WAL receiver or the startup process because we can expect one to exist (some of the TAP tests are able to trigger with a proper timing). Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13 Discussion: https://postgr.es/m/ym6gzbqqdlals...@paquier.xyz --- src/backend/access/transam/timeline.c | 18 +++++------------- src/backend/access/transam/xlog.c | 9 +++------ contrib/basic_archive/basic_archive.c | 5 +++-- 3 files changed, 11 insertions(+), 21 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index be21968293..e0a2a8ea68 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, * Now move the completed history file into place with its final name. */ TLHistoryFilePath(path, newTLI); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + Assert(access(path, F_OK) != 0 && errno == ENOENT); + durable_rename(tmppath, path, ERROR); /* The history file can be archived immediately. */ if (XLogArchivingActive()) @@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) errmsg("could not close file \"%s\": %m", tmppath))); /* - * Now move the completed history file into place with its final name. + * Now move the completed history file into place with its final name, + * replacing any existing file with the same name. */ TLHistoryFilePath(path, tli); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + durable_rename(tmppath, path, ERROR); } /* diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5d8322fbd0..818284759d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3326,14 +3326,11 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, } } - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - if (durable_rename_excl(tmppath, path, LOG) != 0) + Assert(access(path, F_OK) != 0 && errno == ENOENT); + if (durable_rename(tmppath, path, LOG) != 0) { LWLockRelease(ControlFileLock); - /* durable_rename_excl already emitted log message */ + /* durable_rename already emitted log message */ return false; } diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index 87dd77cdd3..db5fd87429 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -282,9 +282,10 @@ basic_archive_file_internal(const char *file, const char *path) /* * Sync the temporary file to disk and move it to its final destination. - * This will fail if destination already exists. + * Note that this will overwrite any existing file, but this is only + * possible if someone else created the file since the stat() above. */ - (void) durable_rename_excl(temp, destination, ERROR); + (void) durable_rename(temp, destination, ERROR); ereport(DEBUG1, (errmsg("archived \"%s\" via basic_archive", file))); -- 2.45.2
From 51649664dd7549ea2f7cadc26d024a3ea56d684a Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@paquier.xyz> Date: Tue, 5 Jul 2022 10:16:12 +0900 Subject: [PATCH] Replace durable_rename_excl() by durable_rename(), take two durable_rename_excl() attempts to avoid overwriting any existing files by using link() and unlink(), and it falls back to rename() on some platforms (aka WIN32), which offers no such overwrite protection. Most callers use durable_rename_excl() just in case there is an existing file, but in practice there shouldn't be one (see below for more details). Furthermore, failures during durable_rename_excl() can result in multiple hard links to the same file. As per Nathan's tests, it is possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting, leading to WAL corruption. This change replaces all the calls of durable_rename_excl() to durable_rename(). This removes the protection against accidentally overwriting an existing file, but some platforms are already living without it and ordinarily there shouldn't be one. The function itself is left around in case any extensions are using it. It will be removed on HEAD via a follow-up commit. Here is a summary of the existing callers of durable_rename_excl() (see second discussion link at the bottom), replaced by this commit. First, basic_archive used it to avoid overwriting an archive concurrently created by another server, but as mentioned above, it will still overwrite files on some platforms. Second, xlog.c uses it to recycle past WAL segments, where an overwrite should not happen (origin of the change at f0e37a8) because there are protections about the WAL segment to select when recycling an entry. The third and last area is related to the write of timeline history files. writeTimeLineHistory() will write a new timeline history file at the end of recovery on promotion, so there should be no such files for the same timeline. What remains is writeTimeLineHistoryFile(), that can be used in parallel by a WAL receiver and the startup process, and some digging of the buildfarm shows that EEXIST from a WAL receiver can happen with an error of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\", which would cause an automatic restart of the WAL receiver as it is promoted to FATAL, hence this should improve the stability of the WAL receiver as rename() would overwrite an existing TLI history file already fetched by the startup process at recovery. This is a bug fix, but knowing the unlikeliness of the problem involving one or more crashes at an exceptionally bad moment, no backpatch is done. Also, I want to be careful with such changes (aaa3aed did the opposite of this change by removing HAVE_WORKING_LINK so as Windows would do a link() rather than a rename() but this was not concurrent-safe). A backpatch could be revisited in the future. This is the second time this change is attempted, ccfbd92 being the first one, but this time no assertions are added for the case of a TLI history file written concurrently by the WAL receiver or the startup process because we can expect one to exist (some of the TAP tests are able to trigger with a proper timing). Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13 Discussion: https://postgr.es/m/ym6gzbqqdlals...@paquier.xyz --- src/backend/access/transam/timeline.c | 18 +++++------------- src/backend/access/transam/xlog.c | 9 +++------ 2 files changed, 8 insertions(+), 19 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..eeb505fb6a 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, * Now move the completed history file into place with its final name. */ TLHistoryFilePath(path, newTLI); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + Assert(access(path, F_OK) != 0 && errno == ENOENT); + durable_rename(tmppath, path, ERROR); /* The history file can be archived immediately. */ if (XLogArchivingActive()) @@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) errmsg("could not close file \"%s\": %m", tmppath))); /* - * Now move the completed history file into place with its final name. + * Now move the completed history file into place with its final name, + * replacing any existing file with the same name. */ TLHistoryFilePath(path, tli); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + durable_rename(tmppath, path, ERROR); } /* diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f1a795bba9..a6e2cb88f3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3693,15 +3693,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, } } - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - if (durable_rename_excl(tmppath, path, LOG) != 0) + Assert(access(path, F_OK) != 0 && errno == ENOENT); + if (durable_rename(tmppath, path, LOG) != 0) { if (use_lock) LWLockRelease(ControlFileLock); - /* durable_rename_excl already emitted log message */ + /* durable_rename already emitted log message */ return false; } -- 2.45.2
From 93c566230e5686da1b1d40186f31b79339be2bcf Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@paquier.xyz> Date: Tue, 5 Jul 2022 10:16:12 +0900 Subject: [PATCH] Replace durable_rename_excl() by durable_rename(), take two durable_rename_excl() attempts to avoid overwriting any existing files by using link() and unlink(), and it falls back to rename() on some platforms (aka WIN32), which offers no such overwrite protection. Most callers use durable_rename_excl() just in case there is an existing file, but in practice there shouldn't be one (see below for more details). Furthermore, failures during durable_rename_excl() can result in multiple hard links to the same file. As per Nathan's tests, it is possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting, leading to WAL corruption. This change replaces all the calls of durable_rename_excl() to durable_rename(). This removes the protection against accidentally overwriting an existing file, but some platforms are already living without it and ordinarily there shouldn't be one. The function itself is left around in case any extensions are using it. It will be removed on HEAD via a follow-up commit. Here is a summary of the existing callers of durable_rename_excl() (see second discussion link at the bottom), replaced by this commit. First, basic_archive used it to avoid overwriting an archive concurrently created by another server, but as mentioned above, it will still overwrite files on some platforms. Second, xlog.c uses it to recycle past WAL segments, where an overwrite should not happen (origin of the change at f0e37a8) because there are protections about the WAL segment to select when recycling an entry. The third and last area is related to the write of timeline history files. writeTimeLineHistory() will write a new timeline history file at the end of recovery on promotion, so there should be no such files for the same timeline. What remains is writeTimeLineHistoryFile(), that can be used in parallel by a WAL receiver and the startup process, and some digging of the buildfarm shows that EEXIST from a WAL receiver can happen with an error of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\", which would cause an automatic restart of the WAL receiver as it is promoted to FATAL, hence this should improve the stability of the WAL receiver as rename() would overwrite an existing TLI history file already fetched by the startup process at recovery. This is a bug fix, but knowing the unlikeliness of the problem involving one or more crashes at an exceptionally bad moment, no backpatch is done. Also, I want to be careful with such changes (aaa3aed did the opposite of this change by removing HAVE_WORKING_LINK so as Windows would do a link() rather than a rename() but this was not concurrent-safe). A backpatch could be revisited in the future. This is the second time this change is attempted, ccfbd92 being the first one, but this time no assertions are added for the case of a TLI history file written concurrently by the WAL receiver or the startup process because we can expect one to exist (some of the TAP tests are able to trigger with a proper timing). Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13 Discussion: https://postgr.es/m/ym6gzbqqdlals...@paquier.xyz --- src/backend/access/transam/timeline.c | 18 +++++------------- src/backend/access/transam/xlog.c | 9 +++------ 2 files changed, 8 insertions(+), 19 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index e6a29d9a9b..517ab023a3 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, * Now move the completed history file into place with its final name. */ TLHistoryFilePath(path, newTLI); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + Assert(access(path, F_OK) != 0 && errno == ENOENT); + durable_rename(tmppath, path, ERROR); /* The history file can be archived immediately. */ if (XLogArchivingActive()) @@ -516,15 +512,11 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) errmsg("could not close file \"%s\": %m", tmppath))); /* - * Now move the completed history file into place with its final name. + * Now move the completed history file into place with its final name, + * replacing any existing file with the same name. */ TLHistoryFilePath(path, tli); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + durable_rename(tmppath, path, ERROR); } /* diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ab4a510ea7..1fa15a58be 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3648,15 +3648,12 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, } } - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - if (durable_rename_excl(tmppath, path, LOG) != 0) + Assert(access(path, F_OK) != 0 && errno == ENOENT); + if (durable_rename(tmppath, path, LOG) != 0) { if (use_lock) LWLockRelease(ControlFileLock); - /* durable_rename_excl already emitted log message */ + /* durable_rename already emitted log message */ return false; } -- 2.45.2
signature.asc
Description: PGP signature