Hi Japin, On Mon, May 25, 2026 at 7:21 AM Japin Li <[email protected]> wrote:
> > Hi, Mats > > On Sun, 24 May 2026 at 20:30, Mats Kindahl <[email protected]> wrote: > > On Fri, May 22, 2026 at 12:09 AM surya poondla <[email protected]> > wrote: > > > > Hi Mats, > > > > Thanks for picking this up -- the scenario is a real one and I think > the UUID-tagging approach is a clean way to > > solve it. v2 applies and builds without trouble, and the core algorithm > reads well to me. > > I have a handful of observations that I'd love your thoughts. > > > > Hi Surya, > > > > Thank you for the review. It is a quite rare scenario, but it is real > and the fix is simple. > > > > Regarding Correctness I have the below thoughts > > > > 1. UUIDv7 timestamp epoch. > > In StartupXLOG(): > > TimestampTz now = GetCurrentTimestamp(); > > generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000), > > (uint32)(now % 1000) * 1000); > > > > I think there might be a small mismatch here: GetCurrentTimestamp() > returns microseconds since the Postgres epoch > > (2000-01-01), > > whereas generate_uuidv7_r describes its first argument as milliseconds > since the Unix epoch. > > As written that 30-year offset would land in the UUID's timestamp > field, so the resulting UUID wouldn't be a > > conformant UUIDv7 and wouldn't > > time-order against UUIDv7s generated through the SQL functions. > > > > > > > > Uniqueness is preserved either way, so the rewind logic still works as > intended but it seemed worth flagging. > > > > I see conversion that's used elsewhere as: > > us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) > > * SECS_PER_DAY * USECS_PER_SEC; > > > > Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) > directly would also be fine. > > > > Yes, the intention was to use a proper timestamp to allow debugging > servers if necessary. Switched to gettimeofday() and > > used 0 for sub-ms since this is not going to be critical. (We could use > ns here as well, but that would only solve a race > > if you have two servers being promoted in the same ms, which I find > unlikely, and there is a random number added for that > > situation.) > > > > 2. EOR-record path, the intent is unclear. > > > > The comment above generate_uuidv7_r() at says: > > > > "The same UUID is written into the history file and later into the > XLOG_END_OF_RECOVERY record so that pg_rewind can > > distinguish two servers..." > > > > But from what I can see only the history-file part actually lands. > > xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't > add the UUID, and XLogCtl->ThisTimeLineUUID is > > written under info_lck without a > > reader (I couldn't grep it). > > > > The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like > preparation for an EOR-struct extension that > > ended up not being part of the patch. > > > > Was the EOR-record piece something you intended to keep for a > follow-up, or has it been superseded by the > > history-file approach? > > > > No, the EOR changes are not needed for the promotion, contrary to what I > originally thought. Cleaned up the comment and > > the code and removed all traces of changes to the EOR (I hope). > > > > > > > > 3. Malformed UUID handling in readTimeLineHistory(). > > > > The optional field-4 path is: > > > > if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN) > > { > > Datum datum = DirectFunctionCall1(uuid_in, > > > CStringGetDatum(uuid_str)); > > ... > > } > > > > uuid_in() raises ereport(ERROR) on a malformed input, while the > surrounding syntax-error paths in readTimeLineHistory > > () use FATAL deliberately. > > In practice an ERROR during startup ends up being fatal too, so this > isn't strictly a bug but it would be nicer to > > stay consistent. > > > > Agree. I added code to capture the error and raise a FATAL instead (with > the error message from the uuid_in, in case it > > is modified it makes sense to show this). > > > > Regarding the Tests I have the following thoughts > > > > The two new cases are nice, a few extensions that I think would > strengthen them: > > 1. A mixed-version case where one side has a zero UUID. That's the path > we're claiming is graceful, but nothing > > currently exercises it > > > > Yes, that should work regardless of whether the source or the target has > the zero UUID. > > > > I realized one thing: if two timelines have identical TLI but one has > zero UUID and one has not, it seems they could not > > come from the same promotion (one promotion happened on an old server > and the other one on a new server), that is, they > > should be treated as different. Does that make sense? I made the > necessary changes in the attached patches for testing. > > Please have a look. > > > > 2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that > findCommonAncestorTimeline's loop walks past > > matching entries > > before hitting the mismatch. The 0002 test puts the divergence at > depth 1. > > > > I was unsure if this test was necessary or interesting, hence a separate > commit. Since you thought it was useful, it's > > now rolled into the patch and I extended the tests with the scenarios > you suggested. > > > > I also did some refactorings of the tests to avoid duplication. More > below. > > > > 3. A small assertion against the on-disk 00000002.history contents, to > pin down the file format. > > 4. On 0002 the dependency on restore_command pointing at node_x's > pg_wal is the kind of thing that tends to break > > under > > environment changes. A CHECKPOINT on node_x before the backup, or > wal_keep_size as in 0001, would let the test > > stand on its own. > > > > Good point. > > > > I refactored the code to avoid some duplication and make the test flow > self-explanatory and as part of that I set the > > wal_keep_size for all nodes. > > > > In the process I noticed that many of the functions in RewindTest.pm do > the same job as the primitives I wrote, but have > > hard-coded variable names. I could rewrite them to take parameters, but > that would be quite a big patch to add additional > > changes to each call site, so I did not do that and rather added small > wrappers specific for the tests in > > 005_same_timeline.pl⚠️. > > > > Attached a new version of the now single patch. > > > > I'm happy to keep reviewing/contributing, thanks again for working on > it. > > > > Thank you for reviewing it. > > Thank you for your work. I have one comment. > > + a = &tlh->source[tlh->sourceNentries - 2].tluuid; > + b = &tlh->target[tlh->targetNentries - 2].tluuid; > + > + if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN) > == 0) > + return true; > + > + return memcmp(a, b, UUID_LEN) == 0; > > Since we already have matchingTimelineUUID(), the above code can be > simplified > using it. > Thank you for the review. I switched to using the matchingTimelineUUID() for this part of the code and made some other minor improvements as well. -- Best wishes, Mats Kindahl, Multigres Developer, Supabase
From 8e27263fc7769e50dbf682b269d9df78dbf1c85e Mon Sep 17 00:00:00 2001 From: Mats Kindahl <[email protected]> Date: Sat, 23 May 2026 16:09:44 +0200 Subject: pg_rewind: use UUIDs to detect independent same-TLI promotions Two PostgreSQL standbys can independently promote to the same timeline ID if their primary stopped before either had a chance to promote. In that situation both clusters share a timeline history prefix that looks identical to pg_rewind: same TLI numbers and same begin/end LSNs. The existing same-TLI shortcut therefore treated the source as a valid rewind target and skipped the rewind entirely, leaving the target's diverged WAL intact. Fix this by embedding a UUIDv7 value in every timeline history file entry at promotion time. Each promotion generates a fresh UUID, so two independent promotions to the same TLI will carry different UUIDs even though the TLI number and begin LSN are identical. When loading the timeline history, pg_rewind uses these UUIDs in two places: 1. findCommonAncestorTimeline checks that the TLI and UUID in each entry match. A mismatch signals independent promotions and the search continues to earlier entries to find the true common ancestor. 2. The same-TLI shortcut (source and target on the same current TLI) compares the UUID stored in the last completed history entry and a mismatch forces a full rewind instead of a no-op. UUIDs are zero for clusters that predate this change, and the comparison function treats a zero UUID on either side as different from a UUID since that promotion has to be from a different server (it had a pre-change version server that was promoted, so it cannot be the same as a post-change version server that was promoted). Two new tests in t/005_same_timeline.pl cover both detection paths. The first covers the same-TLI shortcut: two standbys independently promote to TLI2 and TLI2', each with a distinct UUID. The second covers the ancestor search: the target goes through TLI1 -> TLI2 -> TLI3 while the source independently promoted so that it has a timeline with TLI1 -> TLI2' -> TLI3'. The test ensures that findCommonAncestorTimeline backs up to TLI1 as the true common ancestor rather than accepting the numerically matching TLI2 entry. --- src/backend/access/transam/timeline.c | 79 ++++- src/backend/access/transam/xlog.c | 15 + src/backend/utils/adt/uuid.c | 15 +- src/bin/pg_rewind/pg_rewind.c | 104 ++++++- src/bin/pg_rewind/t/005_same_timeline.pl | 353 +++++++++++++++++++++++ src/bin/pg_rewind/timeline.c | 47 ++- src/include/access/timeline.h | 5 +- src/include/access/xlog_internal.h | 1 + src/include/utils/uuid.h | 10 +- 9 files changed, 606 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 68e5f692d26..df161dcc0d5 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -42,6 +42,8 @@ #include "pgstat.h" #include "storage/fd.h" #include "utils/wait_event.h" +#include "utils/fmgrprotos.h" +#include "utils/uuid.h" /* * Copies all timeline history files with id's between 'begin' and 'end' @@ -110,8 +112,12 @@ readTimeLineHistory(TimeLineID targetTLI) ereport(FATAL, (errcode_for_file_access(), errmsg("could not open file \"%s\": %m", path))); - /* Not there, so assume no parents */ - entry = palloc_object(TimeLineHistoryEntry); + + /* + * Not there, so assume no parents. We use palloc0_object to ensure + * that tluuid is all-zero. + */ + entry = palloc0_object(TimeLineHistoryEntry); entry->tli = targetTLI; entry->begin = entry->end = InvalidXLogRecPtr; return list_make1(entry); @@ -125,6 +131,7 @@ readTimeLineHistory(TimeLineID targetTLI) prevend = InvalidXLogRecPtr; for (;;) { + char uuid_str[UUID_STR_LEN + 1] = {0}; char fline[MAXPGPATH]; char *res; char *ptr; @@ -155,7 +162,8 @@ readTimeLineHistory(TimeLineID targetTLI) if (*ptr == '\0' || *ptr == '#') continue; - nfields = sscanf(fline, "%u\t%X/%08X", &tli, &switchpoint_hi, &switchpoint_lo); + nfields = + sscanf(fline, "%u\t%X/%08X\t%36s", &tli, &switchpoint_hi, &switchpoint_lo, uuid_str); if (nfields < 1) { @@ -164,7 +172,7 @@ readTimeLineHistory(TimeLineID targetTLI) (errmsg("syntax error in history file: %s", fline), errhint("Expected a numeric timeline ID."))); } - if (nfields != 3) + if (nfields < 3) ereport(FATAL, (errmsg("syntax error in history file: %s", fline), errhint("Expected a write-ahead log switchpoint location."))); @@ -176,12 +184,45 @@ readTimeLineHistory(TimeLineID targetTLI) lasttli = tli; - entry = palloc_object(TimeLineHistoryEntry); + /* + * We use palloc0_object to ensure that tluuid is all-zero, which is + * important for pg_rewind to detect whether the history file is + * missing or not. + */ + entry = palloc0_object(TimeLineHistoryEntry); entry->tli = tli; entry->begin = prevend; entry->end = ((uint64) (switchpoint_hi)) << 32 | (uint64) switchpoint_lo; prevend = entry->end; + /* + * Parse the optional UUID field. Old history files have the reason + * string in field 4. It is in theory possible that the reason string + * starts with a UUID, but the current usage do not store a UUID. This + * allows us to support both old and new formats of history files + * without breaking compatibility by checking if the field contains a + * valid UUID. + */ + if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN) + { + PG_TRY(); + { + Datum datum = DirectFunctionCall1(uuid_in, CStringGetDatum(uuid_str)); + + memcpy(&entry->tluuid, DatumGetUUIDP(datum), sizeof(pg_uuid_t)); + } + PG_CATCH(); + { + ErrorData *edata = CopyErrorData(); + + FlushErrorState(); + ereport(FATAL, + errmsg("invalid UUID in history file \"%s\"", path), + errdetail("%s", edata->message)); + } + PG_END_TRY(); + } + /* Build list with newest item first */ result = lcons(entry, result); @@ -197,9 +238,11 @@ readTimeLineHistory(TimeLineID targetTLI) /* * Create one more entry for the "tip" of the timeline, which has no entry - * in the history file. + * in the history file. We use palloc0_object to ensure that tluuid is + * all-zero, which is important for pg_rewind to detect whether the + * history file is missing or not. */ - entry = palloc_object(TimeLineHistoryEntry); + entry = palloc0_object(TimeLineHistoryEntry); entry->tli = targetTLI; entry->begin = prevend; entry->end = InvalidXLogRecPtr; @@ -294,21 +337,33 @@ findNewestTimeLine(TimeLineID startTLI) * * newTLI: ID of the new timeline * parentTLI: ID of its immediate parent + * newTLUUID: UUID uniquely identifying this promotion instance * switchpoint: WAL location where the system switched to the new timeline * reason: human-readable explanation of why the timeline was switched * - * Currently this is only used at the end recovery, and so there are no locking + * The output file is named <newTLI>.history (e.g. 00000003.history). If two + * servers independently promote to the same timeline ID, their history files + * share the same name. In a shared WAL archive the second file to arrive + * silently overwrites the first. The newTLUUID written into the file content + * lets pg_rewind detect this collision: it fetches each server's history file + * directly from that server, compares the UUIDs for every shared TLI, and + * treats a UUID mismatch as evidence of independent promotion even when the + * TLI numbers agree. + * + * Currently this is only used at end of recovery, and so there are no locking * considerations. But we should be just as tense as XLogFileInit to avoid * emplacing a bogus file. */ void writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, + const pg_uuid_t *newTLUUID, XLogRecPtr switchpoint, char *reason) { char path[MAXPGPATH]; char tmppath[MAXPGPATH]; char histfname[MAXFNAMELEN]; char buffer[BLCKSZ]; + char *uuid_str; int srcfd; int fd; int nbytes; @@ -398,13 +453,19 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, * * If we did have a parent file, insert an extra newline just in case the * parent file failed to end with one. + * + * Format: <parentTLI>\t<switchpoint>\t<ThisTimeLineUUID>\t<reason>\n */ + uuid_str = DatumGetCString(DirectFunctionCall1(uuid_out, UUIDPGetDatum(newTLUUID))); + snprintf(buffer, sizeof(buffer), - "%s%u\t%X/%08X\t%s\n", + "%s%u\t%X/%08X\t%s\t%s\n", (srcfd < 0) ? "" : "\n", parentTLI, LSN_FORMAT_ARGS(switchpoint), + uuid_str, reason); + pfree(uuid_str); nbytes = strlen(buffer); errno = 0; diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index beddcb552d6..87486ec6a1c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -99,6 +99,7 @@ #include "storage/subsystems.h" #include "storage/sync.h" #include "utils/guc_hooks.h" +#include "utils/uuid.h" #include "utils/guc_tables.h" #include "utils/injection_point.h" #include "utils/pgstat_internal.h" @@ -6378,6 +6379,9 @@ StartupXLOG(void) newTLI = endOfRecoveryInfo->lastRecTLI; if (ArchiveRecoveryRequested) { + struct timeval tv; + pg_uuid_t uuid_buf; + newTLI = findNewestTimeLine(recoveryTargetTLI) + 1; ereport(LOG, (errmsg("selected new timeline ID: %u", newTLI))); @@ -6408,8 +6412,19 @@ StartupXLOG(void) * to the new timeline, and will try to connect to the new timeline. * To minimize the window for that, try to do as little as possible * between here and writing the end-of-recovery record. + * + * Generate a UUIDv7 that uniquely identifies this promotion. The + * same UUID is written into the history file so that pg_rewind can + * distinguish two servers that independently promoted to the same + * timeline ID. Use gettimeofday() since we are not on a hot path; + * generate_uuidv7 wants milliseconds and we pass 0 for sub-ms since + * the random bits already distinguish UUIDs generated within the same + * millisecond. */ + gettimeofday(&tv, NULL); + generate_uuidv7_r(&uuid_buf, tv.tv_sec * 1000 + tv.tv_usec / 1000, 0); writeTimeLineHistory(newTLI, recoveryTargetTLI, + &uuid_buf, EndOfLog, endOfRecoveryInfo->recoveryStopReason); ereport(LOG, diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c index 6ee3752ac78..f1dc0196cd8 100644 --- a/src/backend/utils/adt/uuid.c +++ b/src/backend/utils/adt/uuid.c @@ -72,7 +72,7 @@ static bool uuid_abbrev_abort(int memtupcount, SortSupport ssup); static Datum uuid_abbrev_convert(Datum original, SortSupport ssup); static inline void uuid_set_version(pg_uuid_t *uuid, unsigned char version); static inline int64 get_real_time_ns_ascending(void); -static pg_uuid_t *generate_uuidv7(uint64 unix_ts_ms, uint32 sub_ms); +pg_uuid_t *generate_uuidv7(uint64 unix_ts_ms, uint32 sub_ms); Datum uuid_in(PG_FUNCTION_ARGS) @@ -581,6 +581,14 @@ get_real_time_ns_ascending(void) return ns; } +pg_uuid_t * +generate_uuidv7(uint64 unix_ts_ms, uint32 sub_ms) +{ + pg_uuid_t *uuid = palloc(UUID_LEN); + + return generate_uuidv7_r(uuid, unix_ts_ms, sub_ms); +} + /* * Generate UUID version 7 per RFC 9562, with the given timestamp. * @@ -597,10 +605,9 @@ get_real_time_ns_ascending(void) * * NB: all numbers here are unsigned, unix_ts_ms cannot be negative per RFC. */ -static pg_uuid_t * -generate_uuidv7(uint64 unix_ts_ms, uint32 sub_ms) +pg_uuid_t * +generate_uuidv7_r(pg_uuid_t *uuid, uint64 unix_ts_ms, uint32 sub_ms) { - pg_uuid_t *uuid = palloc(UUID_LEN); uint32 increased_clock_precision; /* Fill in time part */ diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 9d745d4b25b..9587cfd5792 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -32,6 +32,19 @@ #include "rewind_source.h" #include "storage/bufpage.h" +/* + * Timeline histories for both clusters, populated by timelines_match(). + */ +typedef struct TimelineHistoriesData +{ + TimeLineHistoryEntry *source, + *target; + int sourceNentries, + targetNentries; +} TimelineHistoriesData; + +typedef TimelineHistoriesData * TimelineHistories; + static void usage(const char *progname); static void perform_rewind(filemap_t *filemap, rewind_source *source, @@ -53,6 +66,9 @@ static void findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, TimeLineHistoryEntry *b_history, int b_nentries, XLogRecPtr *recptr, int *tliIndex); +static inline bool matchingTimelineUUID(TimeLineHistoryEntry *a, TimeLineHistoryEntry *b); +static bool matchAndFetchTimelines(TimeLineID source_tli, TimeLineID target_tli, + TimelineHistories timelineHistories); static void ensureCleanShutdown(const char *argv0); static void disconnect_atexit(void); @@ -141,6 +157,7 @@ main(int argc, char **argv) int c; XLogRecPtr divergerec; int lastcommontliIndex; + TimelineHistoriesData timelineHistories; XLogRecPtr chkptrec; TimeLineID chkpttli; XLogRecPtr chkptredo; @@ -372,10 +389,21 @@ main(int argc, char **argv) * * If both clusters are already on the same timeline, there's nothing to * do. + * + * This also handles the case when two servers independently promoted to + * the same timeline ID: one crashed after writing the history file but + * before its EOR WAL record was distributed, so a second standby promoted + * independently. The history files produced by those two promotions + * carry different UUIDs. + * + * When the clusters are on different timelines we locate the fork point + * via findCommonAncestorTimeline. */ - if (target_tli == source_tli) + if (matchAndFetchTimelines(source_tli, target_tli, &timelineHistories)) { pg_log_info("source and target cluster are on the same timeline"); + pfree(timelineHistories.source); + pfree(timelineHistories.target); rewind_needed = false; target_wal_endrec = InvalidXLogRecPtr; } @@ -389,8 +417,10 @@ main(int argc, char **argv) * Retrieve timelines for both source and target, and find the point * where they diverged. */ - sourceHistory = getTimelineHistory(source_tli, true, &sourceNentries); - targetHistory = getTimelineHistory(target_tli, false, &targetNentries); + targetHistory = timelineHistories.target; + targetNentries = timelineHistories.targetNentries; + sourceHistory = timelineHistories.source; + sourceNentries = timelineHistories.sourceNentries; findCommonAncestorTimeline(sourceHistory, sourceNentries, targetHistory, targetNentries, @@ -874,7 +904,7 @@ getTimelineHistory(TimeLineID tli, bool is_source, int *nentries) */ if (tli == 1) { - history = pg_malloc_object(TimeLineHistoryEntry); + history = pg_malloc0_object(TimeLineHistoryEntry); history->tli = tli; history->begin = history->end = InvalidXLogRecPtr; *nentries = 1; @@ -920,6 +950,56 @@ getTimelineHistory(TimeLineID tli, bool is_source, int *nentries) return history; } +/* + * Return true if two per-entry promotion UUIDs are compatible. + * + * A zero UUID means the history file predates this fix (or the entry is + * synthetic). If both sides are zero we have no UUID information and fall + * back to TLI-number-only matching (backward compatibility with old servers). + * If one side carries a UUID and the other does not, they cannot originate + * from the same promotion and are treated as incompatible. + */ +static inline bool +matchingTimelineUUID(TimeLineHistoryEntry *a, TimeLineHistoryEntry *b) +{ + static const pg_uuid_t zero = {{0}}; + + if (memcmp(&a->tluuid, &zero, UUID_LEN) == 0 && memcmp(&b->tluuid, &zero, UUID_LEN) == 0) + return true; + return memcmp(&a->tluuid, &b->tluuid, UUID_LEN) == 0; +} + +/* + * Fetch the timeline history for both clusters, store them in tlh, and return + * true if the clusters are on the same timeline (no rewind needed). + * + * tlh is always fully populated on return regardless of the result, so the + * caller can pass tlh->source / tlh->target directly to + * findCommonAncestorTimeline() when the return value is false. + * + * TLI 1 always returns true: it is the original timeline and has no promotion + * UUID. For TLI >= 2, the UUID in entry[Nentries - 2] identifies the + * promotion that created the current TLI. Both-zero UUIDs (old history files) + * are treated as compatible; zero-vs-nonzero is treated as a mismatch because + * one side carries a promotion UUID and they cannot be the same promotion. + */ +static bool +matchAndFetchTimelines(TimeLineID source_tli, TimeLineID target_tli, TimelineHistories tlh) +{ + tlh->source = getTimelineHistory(source_tli, true, &tlh->sourceNentries); + tlh->target = getTimelineHistory(target_tli, false, &tlh->targetNentries); + + if (source_tli != target_tli) + return false; + + /* TLI 1 has no promotion UUID; always treat as the same timeline. */ + if (tlh->sourceNentries < 2 || tlh->targetNentries < 2) + return true; + + return matchingTimelineUUID(&tlh->source[tlh->sourceNentries - 2], + &tlh->target[tlh->targetNentries - 2]); +} + /* * Determine the TLI of the last common timeline in the timeline history of * two clusters. *tliIndex is set to the index of last common timeline in @@ -941,12 +1021,26 @@ findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, int a_nentries, * depending on the history files that each node has fetched in previous * recovery processes. Hence check the start position of the new timeline * as well and move down by one extra timeline entry if they do not match. + * + * We also compare timeline UUIDs when both sides carry one. Two servers + * that independently promoted to the same timeline ID produce history + * files with the same name (e.g. 00000003.history); in a shared WAL + * archive the second file silently overwrites the first. pg_rewind + * fetches each server's history file directly from that server, so it + * sees both UUIDs. + * + * The timeline UUID stored in history entry[i] is the UUID of the + * promotion that created entry[i+1], i.e. the UUID of TLI entry[i+1].tli. + * So to check whether entry[i] itself represents the same timeline on + * both sides we look at entry[i-1].tluuid (for i > 0). TLI 1 (i == 0) is + * always the same: it is the original timeline and has no promotion UUID. */ n = Min(a_nentries, b_nentries); for (i = 0; i < n; i++) { if (a_history[i].tli != b_history[i].tli || - a_history[i].begin != b_history[i].begin) + a_history[i].begin != b_history[i].begin || + (i > 0 && !matchingTimelineUUID(&a_history[i - 1], &b_history[i - 1]))) break; } diff --git a/src/bin/pg_rewind/t/005_same_timeline.pl b/src/bin/pg_rewind/t/005_same_timeline.pl index 95a40c3b270..52963f29e00 100644 --- a/src/bin/pg_rewind/t/005_same_timeline.pl +++ b/src/bin/pg_rewind/t/005_same_timeline.pl @@ -7,6 +7,8 @@ # use strict; use warnings FATAL => 'all'; +use File::Copy; +use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; @@ -21,4 +23,355 @@ RewindTest::create_standby(); RewindTest::run_pg_rewind('local'); RewindTest::clean_rewind_test(); +# Helper function to run pg_rewind in local mode with the given source and +# target nodes and extra arguments. +# +# The target and source nodes are stopped before the call and the target is +# restarted afterward. The target's postgresql.conf is copied to a temporary +# location and passed to pg_rewind with --config-file, so that pg_rewind can +# update the target's config file in place without worrying about file +# permissions. The temporary config file is moved back to the target's data +# directory and permissions fixed after pg_rewind finishes. +sub rewind_node +{ + my ($target, $source, $label, @extra_args) = @_; + $source->stop; + $target->stop; + + my $tpgdata = $target->data_dir; + my $tmp = PostgreSQL::Test::Utils::tempdir; + copy("$tpgdata/postgresql.conf", "$tmp/target-postgresql.conf.tmp"); + + command_ok( + [ + 'pg_rewind', + '--debug', + '--source-pgdata' => $source->data_dir, + '--target-pgdata' => $target->data_dir, + '--no-sync', + '--config-file' => "$tmp/target-postgresql.conf.tmp", + @extra_args, + ], + $label); + + move("$tmp/target-postgresql.conf.tmp", "$tpgdata/postgresql.conf"); + chmod($target->group_access() ? 0640 : 0600, "$tpgdata/postgresql.conf") + or BAIL_OUT("unable to set permissions for $tpgdata/postgresql.conf"); + + $target->start; +} + +# Rewrite a node's TLI history file in the old 3-field format (no UUID), so +# that pg_rewind sees a zero UUID for that side, as if the node had been +# promoted by a server that predates the UUID feature. +sub strip_tli_uuid +{ + my ($node, $tli) = @_; + my $histfile = sprintf("%s/pg_wal/%08X.history", $node->data_dir, $tli); + open(my $fh, '<', $histfile) or die "cannot open $histfile: $!"; + my @lines = <$fh>; + close $fh; + open($fh, '>', $histfile) or die "cannot write $histfile: $!"; + for my $line (@lines) + { + chomp $line; + my @f = split(/\t/, $line, 4); + if (@f == 4) + { + # Drop the UUID field (index 2); keep parentTLI, switchpoint, reason. + print $fh join("\t", $f[0], $f[1], $f[3]) . "\n"; + } + else + { + print $fh "$line\n"; + } + } + close $fh; +} + +# Helper function to create an origin node with a test table and a row containing +# the given label. The node is started and ready for use as a source for +# standbys. +sub setup_origin +{ + my ($label) = @_; + my $node = PostgreSQL::Test::Cluster->new($label); + $node->init(allows_streaming => 1); + $node->append_conf('postgresql.conf', "wal_keep_size = 320MB\n"); + $node->start; + $node->safe_psql('postgres', "CREATE TABLE tbl (val text)"); + $node->safe_psql('postgres', "INSERT INTO tbl VALUES ('$label')"); + $node->safe_psql('postgres', 'CHECKPOINT'); + return $node; +} + +# Helper function to create multiple standby nodes from the same origin node. +# Each standby gets its own backup and data directory, so that they will +# generate independent UUIDs on promotion even though they share the same +# timeline history up to the point of promotion. +sub setup_standbys_from_origin +{ + my ($origin, @names) = @_; + my @standbys; + for my $name (@names) + { + my $standby = PostgreSQL::Test::Cluster->new($name); + $origin->backup($standby->name); + $standby->init_from_backup($origin, $standby->name, + has_streaming => 1); + $standby->append_conf('postgresql.conf', "wal_keep_size = 320MB\n"); + $standby->set_standby_mode(); + $standby->start; + push @standbys, $standby; + } + return @standbys; +} + +# Helper function to wait for multiple standby nodes to catch up to the origin. +sub sync_standbys_with_origin +{ + my ($origin, @standbys) = @_; + $origin->wait_for_catchup($_) for @standbys; +} + +# Helper function to insert a row with the given label into a node's test table. +sub write_record +{ + my ($node, $label) = @_; + $node->safe_psql('postgres', "INSERT INTO tbl VALUES ('$label')"); + $node->safe_psql('postgres', 'CHECKPOINT'); +} + +# Test that pg_rewind detects and handles two standbys that independently +# promoted to the same timeline ID. Before the UUID-based divergence check, +# pg_rewind's same-TLI shortcut would incorrectly skip the rewind in this +# case, leaving the target's diverged WAL intact. +# +# origin (TLI 1) +# | +# +--- node_a (TLI 1) --promote--> TLI 2, UUID-A (target) +# | +# +--- node_b (TLI 1) --promote--> TLI 2, UUID-B (source) +# +# pg_rewind must detect the UUID mismatch and rewind node_a to match node_b. + +my $node_origin = setup_origin('origin'); + +# Create node_a and node_b from separate backups of origin so that each +# has its own data directory and will generate an independent UUID on promotion. +my ($node_a, $node_b) = + setup_standbys_from_origin($node_origin, 'node_a', 'node_b'); + +# Wait for both standbys to catch up to origin, then stop origin. After +# this point the two standbys are isolated and will promote independently. +sync_standbys_with_origin($node_origin, $node_a, $node_b); +$node_origin->stop; + +# Promote both standbys. Each lands on TLI 2 but generates a distinct UUID, +# so the resulting clusters are diverged even though they share a timeline ID. +$node_a->promote; +$node_b->promote; + +# Insert a divergent row on each so the rewind has visible work to do. +write_record($node_a, 'in A'); +write_record($node_b, 'in B'); + +rewind_node($node_a, $node_b, + 'pg_rewind detects independent same-TLI promotions'); + +my $result = + $node_a->safe_psql('postgres', "SELECT val FROM tbl ORDER BY val"); +is($result, "in B\norigin", + 'rewound node has source data, not its own divergent data'); + +$node_a->teardown_node; +$node_b->teardown_node; +$node_origin->teardown_node; + +# Test that pg_rewind correctly rewinds across a TLI mismatch buried in a shared +# prefix of the timeline history. The target has gone through three timelines +# (TLI 1 -> TLI 2 -> TLI 3) while the source independently promoted from TLI 1 +# to what is numerically TLI 2 but with a different UUID (TLI 2'). The deepest +# common ancestor is therefore TLI 1, and pg_rewind must rewind the target all +# the way back to the end of TLI 1. +# +# origin (TLI 1) --+-- node_x --promote--> TLI 2 -- node_a --promote--> TLI 3 +# | (target: TLI 1->TLI 2->TLI 3) +# +-- node_b --promote--> TLI 2' +# (source: TLI 1->TLI 2') +# +# findCommonAncestorTimeline walks forward: TLI 1 entries match (UUID=0 on +# both sides), then TLI 2 vs TLI 2' match on tli and begin but differ on +# UUID, signalling independent promotions. The algorithm therefore backs up +# to TLI 1 as the common ancestor and sets the divergence point to the end +# of TLI 1. + +my $node_origin2 = setup_origin('origin2'); + +# node_x and node_b both start from the same TLI 1 baseline. +my ($node_x, $node_b2) = + setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2'); + +# Both standbys must be caught up to the same LSN before origin stops, so +# that TLI 2 and TLI 2' both begin at the same WAL position. +sync_standbys_with_origin($node_origin2, $node_x, $node_b2); +$node_origin2->stop; + +# Promote node_x to TLI 2 (UUID-X) and insert a row. node_b2 is still on +# TLI 1 and has not yet seen any TLI 2 WAL. +$node_x->promote; +write_record($node_x, 'x'); + +# Build node_a2 as a standby of node_x, then promote it to TLI 3. +my ($node_a2) = setup_standbys_from_origin($node_x, 'node_a2'); + +sync_standbys_with_origin($node_x, $node_a2); +$node_x->stop; + +$node_a2->promote; + +# Now promote node_b2 independently from TLI 1 to TLI 2' (UUID-B, != UUID-X). +$node_b2->promote; +write_record($node_b2, 'b'); + +# Rewind node_a2 (TLI 1->TLI 2->TLI 3) from node_b2 (TLI 1->TLI 2') in +# local mode. The rewind must reach back to the end of TLI 1. +# +# node_a2 was initialised from a streaming backup of node_x taken after +# node_x had already completed segment 4 of TLI 2; that segment therefore +# does not appear in node_a2's pg_wal. pg_rewind's backward scan for the +# last checkpoint before the divergence point needs that segment, so we +# point restore_command at node_x's pg_wal and use --restore-target-wal. +# +# Note: no row is inserted on TLI 3. This is intentional: the only +# post-divergence table modification in the target's WAL is the 'x' INSERT +# on TLI 2. On unpatched code the WAL scan would start from the TLI 2 +# shutdown checkpoint (just before TLI 3), miss that earlier insert, and +# leave 'x' in place instead of replacing it with 'b'. +my $node_x_waldir = $node_x->data_dir . "/pg_wal"; +$node_a2->append_conf('postgresql.conf', + "restore_command = 'cp \"$node_x_waldir/%f\" \"%p\"'\n"); + +rewind_node($node_a2, $node_b2, + 'pg_rewind rewinds across mismatched TLI 2 / TLI 2-prime to TLI 1', + '--restore-target-wal'); +my $result2 = + $node_a2->safe_psql('postgres', "SELECT val FROM tbl ORDER BY val"); +is($result2, "b\norigin2", + 'rewound node reflects source history, not target TLI 2/TLI 3 data'); + +$node_a2->teardown_node; +$node_b2->teardown_node; +$node_x->teardown_node; +$node_origin2->teardown_node; + +# Test that pg_rewind correctly detects a mismatch when one cluster's TLI 2 +# history entry carries a zero UUID (old-format history file) while the other +# carries a real UUID. The two clusters must have promoted independently, so +# pg_rewind must rewind to TLI 1 rather than accepting the same-TLI shortcut. +# +# Run both orientations: +# (a) target has zero UUID, source has real UUID +# (b) target has real UUID, source has zero UUID +# +# In both cases the setup is: +# +# origin (TLI 1) --+-- node_p --promote--> TLI 2, UUID-P (target) +# | +# +-- node_q --promote--> TLI 2, UUID-Q (source) +# +# One side then has its history file rewritten to the old 3-field format so +# that its UUID reads as zero. pg_rewind must treat zero-vs-nonzero as +# incompatible (they cannot be the same promotion) and rewind to TLI 1. + +for my $strip_target (1, 0) +{ + my $zero_side = $strip_target ? 'target' : 'source'; + my $real_side = $strip_target ? 'source' : 'target'; + my $sfx = $strip_target ? 'zt' : 'zs'; + my $label = + "pg_rewind rewinds when $zero_side has zero UUID and $real_side has real UUID"; + + my $node_origin3 = setup_origin("origin3_$sfx"); + my ($node_p, $node_q) = + setup_standbys_from_origin($node_origin3, "node_p_$sfx", "node_q_$sfx"); + + sync_standbys_with_origin($node_origin3, $node_p, $node_q); + $node_origin3->stop; + + $node_p->promote; + $node_q->promote; + + write_record($node_p, 'in P'); + write_record($node_q, 'in Q'); + + # Strip UUID from the chosen side to simulate a pre-UUID server. + strip_tli_uuid($strip_target ? $node_p : $node_q, 2); + + rewind_node($node_p, $node_q, $label); + my $result3 = + $node_p->safe_psql('postgres', "SELECT val FROM tbl ORDER BY val"); + is( $result3, + "in Q\norigin3_$sfx", + 'rewound node has source data, not its own divergent row'); + + $node_p->teardown_node; + $node_q->teardown_node; + $node_origin3->teardown_node; +} + +# Test that pg_rewind detects independent promotions to TLI 3 when both +# clusters share a common TLI 1 -> TLI 2 history (same UUID) but independently +# promoted from TLI 2 to TLI 3, producing different TLI 3 UUIDs. +# +# origin (TLI 1) --- node_mid --promote--> TLI 2, UUID-M +# | +# +-- node_c --promote--> TLI 3, UUID-C (target) +# | +# +-- node_d --promote--> TLI 3', UUID-D (source) +# +# The same-TLI shortcut compares entry[Nentries-2].tluuid on each side; that +# is the UUID of the TLI 3 promotion, which differs. The full rewind path +# then walks the history forward: TLI 1 matches (same tli/begin/UUID-M at +# entry[0]), TLI 2 also matches (same tli/begin; UUID-M is the same on both +# sides at entry[0]), but TLI 3 vs TLI 3' differ at entry[1] (UUID-C != UUID-D), +# so the divergence point is set to the end of TLI 2. + +my $node_origin4 = setup_origin('origin4'); +my ($node_mid) = setup_standbys_from_origin($node_origin4, 'node_mid'); + +sync_standbys_with_origin($node_origin4, $node_mid); +$node_origin4->stop; + +# Promote node_mid to TLI 2 and insert a row that both TLI 3 nodes will share. +$node_mid->promote; +write_record($node_mid, 'mid'); + +# node_c and node_d both start as standbys of node_mid so they share the same +# TLI 2 promotion UUID (UUID-M). +my ($node_c, $node_d) = + setup_standbys_from_origin($node_mid, 'node_c', 'node_d'); +sync_standbys_with_origin($node_mid, $node_c, $node_d); +$node_mid->stop; + +# Promote both independently; each generates a distinct TLI 3 UUID. +$node_c->promote; +$node_d->promote; + +write_record($node_c, 'c'); +write_record($node_d, 'd'); + +rewind_node($node_c, $node_d, + 'pg_rewind detects independent TLI 3 / TLI 3-prime promotions sharing TLI 2' +); +my $result4 = + $node_c->safe_psql('postgres', "SELECT val FROM tbl ORDER BY val"); +is($result4, "d\nmid\norigin4", + 'rewound node has source TLI 3-prime data, not its own TLI 3 data'); + +$node_c->teardown_node; +$node_d->teardown_node; +$node_mid->teardown_node; +$node_origin4->teardown_node; + done_testing(); diff --git a/src/bin/pg_rewind/timeline.c b/src/bin/pg_rewind/timeline.c index dda06eaa0bc..b6500606b27 100644 --- a/src/bin/pg_rewind/timeline.c +++ b/src/bin/pg_rewind/timeline.c @@ -9,9 +9,40 @@ */ #include "postgres_fe.h" +#include <ctype.h> +#include <string.h> + #include "access/timeline.h" #include "pg_rewind.h" +/* + * Parse a UUID string in standard dashed form into a pg_uuid_t. + * Returns true on success, false if str is not a valid UUID string. + */ +static bool +rewind_parse_uuid(const char *str, pg_uuid_t *uuid) +{ + const char *src = str; + + for (int i = 0; i < UUID_LEN; i++) + { + char buf[3]; + + if (!isxdigit((unsigned char) src[0]) || + !isxdigit((unsigned char) src[1])) + return false; + buf[0] = src[0]; + buf[1] = src[1]; + buf[2] = '\0'; + uuid->data[i] = (unsigned char) strtoul(buf, NULL, 16); + src += 2; + /* skip dash at positions after bytes 3, 5, 7, 9 (i == 3,5,7,9) */ + if (src[0] == '-' && (i == 3 || i == 5 || i == 7 || i == 9)) + src++; + } + return (*src == '\0'); +} + /* * This is copy-pasted from the backend readTimeLineHistory, modified to * return a malloc'd array and to work without backend functions. @@ -48,6 +79,7 @@ rewind_parseTimeLineHistory(char *buffer, TimeLineID targetTLI, int *nentries) uint32 switchpoint_hi; uint32 switchpoint_lo; int nfields; + char uuid_str[UUID_STR_LEN + 1] = {0}; fline = bufptr; while (*bufptr && *bufptr != '\n') @@ -66,7 +98,8 @@ rewind_parseTimeLineHistory(char *buffer, TimeLineID targetTLI, int *nentries) if (*ptr == '\0' || *ptr == '#') continue; - nfields = sscanf(fline, "%u\t%X/%08X", &tli, &switchpoint_hi, &switchpoint_lo); + nfields = sscanf(fline, "%u\t%X/%08X\t%36s", &tli, &switchpoint_hi, + &switchpoint_lo, uuid_str); if (nfields < 1) { @@ -75,7 +108,7 @@ rewind_parseTimeLineHistory(char *buffer, TimeLineID targetTLI, int *nentries) pg_log_error_detail("Expected a numeric timeline ID."); exit(1); } - if (nfields != 3) + if (nfields < 3) { pg_log_error("syntax error in history file: %s", fline); pg_log_error_detail("Expected a write-ahead log switchpoint location."); @@ -99,7 +132,14 @@ rewind_parseTimeLineHistory(char *buffer, TimeLineID targetTLI, int *nentries) entry->end = ((uint64) (switchpoint_hi)) << 32 | (uint64) switchpoint_lo; prevend = entry->end; - /* we ignore the remainder of each line */ + /* + * Parse the optional UUID field. Old history files have the reason + * string in field 4; its first word is much shorter than UUID_STR_LEN + * so the length check safely distinguishes old from new format. + */ + memset(&entry->tluuid, 0, sizeof(pg_uuid_t)); + if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN) + rewind_parse_uuid(uuid_str, &entry->tluuid); } if (entries && targetTLI <= lasttli) @@ -123,6 +163,7 @@ rewind_parseTimeLineHistory(char *buffer, TimeLineID targetTLI, int *nentries) entry->tli = targetTLI; entry->begin = prevend; entry->end = InvalidXLogRecPtr; + memset(&entry->tluuid, 0, sizeof(pg_uuid_t)); *nentries = nlines; return entries; diff --git a/src/include/access/timeline.h b/src/include/access/timeline.h index 97f1d619c35..cdd642c94f0 100644 --- a/src/include/access/timeline.h +++ b/src/include/access/timeline.h @@ -13,6 +13,7 @@ #include "access/xlogdefs.h" #include "nodes/pg_list.h" +#include "utils/uuid.h" /* * A list of these structs describes the timeline history of the server. Each @@ -22,9 +23,10 @@ * pointers of all the entries form a contiguous line from beginning of time * to infinity. */ -typedef struct +typedef struct TimeLineHistoryEntry { TimeLineID tli; + pg_uuid_t tluuid; /* from history file; zero if unknown */ XLogRecPtr begin; /* inclusive */ XLogRecPtr end; /* exclusive, InvalidXLogRecPtr means infinity */ } TimeLineHistoryEntry; @@ -33,6 +35,7 @@ extern List *readTimeLineHistory(TimeLineID targetTLI); extern bool existsTimeLineHistory(TimeLineID probeTLI); extern TimeLineID findNewestTimeLine(TimeLineID startTLI); extern void writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, + const pg_uuid_t *newTLUUID, XLogRecPtr switchpoint, char *reason); extern void writeTimeLineHistoryFile(TimeLineID tli, char *content, int size); extern void restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end); diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 55663e6f4af..20a2f345fd3 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -22,6 +22,7 @@ #include "access/xlogdefs.h" #include "access/xlogreader.h" #include "datatype/timestamp.h" +#include "utils/uuid.h" #include "lib/stringinfo.h" #include "pgtime.h" #include "storage/block.h" diff --git a/src/include/utils/uuid.h b/src/include/utils/uuid.h index 572d8cf4c36..6839de2e0b2 100644 --- a/src/include/utils/uuid.h +++ b/src/include/utils/uuid.h @@ -17,12 +17,16 @@ /* uuid size in bytes */ #define UUID_LEN 16 +/* length of a UUID string (without null terminator): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx */ +#define UUID_STR_LEN 36 + typedef struct pg_uuid_t { unsigned char data[UUID_LEN]; } pg_uuid_t; -/* fmgr interface macros */ +/* fmgr interface macros (backend only) */ +#ifndef FRONTEND static inline Datum UUIDPGetDatum(const pg_uuid_t *X) { @@ -38,5 +42,9 @@ DatumGetUUIDP(Datum X) } #define PG_GETARG_UUID_P(X) DatumGetUUIDP(PG_GETARG_DATUM(X)) +#endif /* !FRONTEND */ + +extern pg_uuid_t *generate_uuidv7(uint64 unix_ts_ms, uint32 sub_ms); +extern pg_uuid_t *generate_uuidv7_r(pg_uuid_t *uuid, uint64 unix_ts_ms, uint32 sub_ms); #endif /* UUID_H */ -- 2.43.0
