Re: pg_rewind does not rewind diverging timelines

Japin Li Mon, 25 May 2026 23:56:37 -0700


Hi, Mats


Thanks for updating the patch.

On Mon, 25 May 2026 at 20:59, Mats Kindahl <[email protected]> wrote:
> Hi Japin,
>
> On Mon, May 25, 2026 at 7:21 AM Japin Li <[email protected]> wrote:
>
>  Hi, Mats
>
>  On Sun, 24 May 2026 at 20:30, Mats Kindahl <[email protected]> wrote:
>  > On Fri, May 22, 2026 at 12:09 AM surya poondla <[email protected]> 
> wrote:
>  >
>  >  Hi Mats,
>  >
>  >  Thanks for picking this up -- the scenario is a real one and I think the 
> UUID-tagging approach is a clean way to
>  >  solve it. v2 applies and builds without trouble, and the core algorithm 
> reads well to me. 
>  >  I have a handful of observations that I'd love your thoughts.
>  >
>  > Hi Surya,
>  >
>  > Thank you for the review. It is a quite rare scenario, but it is real and 
> the fix is simple.
>  >  
>  >  Regarding Correctness I have the below thoughts
>  >
>  >  1. UUIDv7 timestamp epoch.
>  >       In StartupXLOG():
>  >           TimestampTz now = GetCurrentTimestamp();
>  >           generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
>  >                                        (uint32)(now % 1000) * 1000);
>  >
>  >  I think there might be a small mismatch here: GetCurrentTimestamp() 
> returns microseconds since the Postgres epoch
>  >  (2000-01-01), 
>  >  whereas generate_uuidv7_r describes its first argument as milliseconds 
> since the Unix epoch. 
>  >  As written that 30-year offset would land in the UUID's timestamp field, 
> so the resulting UUID wouldn't be a
>  >  conformant UUIDv7 and wouldn't
>  >  time-order against UUIDv7s generated through the SQL functions.
>  >
>  >  
>  >  
>  >  Uniqueness is preserved either way, so the rewind logic still works as 
> intended but it seemed worth flagging.
>  >
>  >  I see conversion that's used elsewhere as:
>  >  us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
>  >                     * SECS_PER_DAY * USECS_PER_SEC;
>  >
>  >  Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) 
> directly would also be fine.
>  >
>  > Yes, the intention was to use a proper timestamp to allow debugging 
> servers if necessary. Switched to gettimeofday
>  () and
>  > used 0 for sub-ms since this is not going to be critical. (We could use ns 
> here as well, but that would only solve
>  a race
>  > if you have two servers being promoted in the same ms, which I find 
> unlikely, and there is a random number added
>  for that
>  > situation.)
>  >  
>  >  2. EOR-record path, the intent is unclear.
>  >
>  >  The comment above generate_uuidv7_r() at says:
>  >
>  >  "The same UUID is written into the history file and later into the 
> XLOG_END_OF_RECOVERY record so that pg_rewind
>  can
>  >  distinguish two servers..."
>  >
>  >  But from what I can see only the history-file part actually lands.
>  >  xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add 
> the UUID, and XLogCtl->ThisTimeLineUUID
>  is
>  >  written under info_lck without a
>  >  reader (I couldn't grep it). 
>  >
>  >  The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like 
> preparation for an EOR-struct extension
>  that
>  >  ended up not being part of the patch.
>  >
>  >  Was the EOR-record piece something you intended to keep for a follow-up, 
> or has it been superseded by the
>  >  history-file approach?
>  >
>  > No, the EOR changes are not needed for the promotion, contrary to what I 
> originally thought. Cleaned up the comment
>  and
>  > the code and removed all traces of changes to the EOR (I hope).
>  >  
>  >       
>  >
>  >  3. Malformed UUID handling in readTimeLineHistory().
>  >
>  >       The optional field-4 path is:
>  >
>  >           if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
>  >           {
>  >               Datum datum = DirectFunctionCall1(uuid_in,
>  >                                                 CStringGetDatum(uuid_str));
>  >               ...
>  >           }
>  >
>  >  uuid_in() raises ereport(ERROR) on a malformed input, while the 
> surrounding syntax-error paths in
>  readTimeLineHistory
>  >  () use FATAL deliberately. 
>  >  In practice an ERROR during startup ends up being fatal too, so this 
> isn't strictly a bug but it would be nicer to
>  >  stay consistent.
>  >
>  > Agree. I added code to capture the error and raise a FATAL instead (with 
> the error message from the uuid_in, in
>  case it
>  > is modified it makes sense to show this).
>  >  
>  >  Regarding the Tests I have the following thoughts
>  >
>  >  The two new cases are nice, a few extensions that I think would 
> strengthen them:
>  >  1. A mixed-version case where one side has a zero UUID. That's the path 
> we're claiming is graceful, but nothing
>  >  currently exercises it
>  >
>  > Yes, that should work regardless of whether the source or the target has 
> the zero UUID.
>  >
>  > I realized one thing: if two timelines have identical TLI but one has zero 
> UUID and one has not, it seems they
>  could not
>  > come from the same promotion (one promotion happened on an old server and 
> the other one on a new server), that is,
>  they
>  > should be treated as different. Does that make sense? I made the necessary 
> changes in the attached patches for
>  testing.
>  > Please have a look.
>  >  
>  >  2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that 
> findCommonAncestorTimeline's loop walks past
>  >  matching entries
>  >       before hitting the mismatch. The 0002 test puts the divergence at 
> depth 1.
>  >
>  > I was unsure if this test was necessary or interesting, hence a separate 
> commit. Since you thought it was useful,
>  it's
>  > now rolled into the patch and I extended the tests with the scenarios you 
> suggested.
>  >
>  > I also did some refactorings of the tests to avoid duplication. More below.
>  >  
>  >  3. A small assertion against the on-disk 00000002.history contents, to 
> pin down the file format.
>  >  4. On 0002 the dependency on restore_command pointing at node_x's pg_wal 
> is the kind of thing that tends to break
>  >  under
>  >       environment changes. A CHECKPOINT on node_x before the backup, or 
> wal_keep_size as in 0001, would let the
>  test
>  >  stand on its own.
>  >
>  > Good point.
>  >
>  > I refactored the code to avoid some duplication and make the test flow 
> self-explanatory and as part of that I set
>  the
>  > wal_keep_size for all nodes.
>  >
>  > In the process I noticed that many of the functions in RewindTest.pm do 
> the same job as the primitives I wrote, but
>  have
>  > hard-coded variable names. I could rewrite them to take parameters, but 
> that would be quite a big patch to add
>  additional
>  > changes to each call site, so I did not do that and rather added small 
> wrappers specific for the tests in
>  > 005_same_timeline.pl⚠️⚠️.
>  >  
>  > Attached a new version of the now single patch.
>  >
>  >  I'm happy to keep reviewing/contributing, thanks again for working on it.
>  >
>  > Thank you for reviewing it.
>
>  Thank you for your work.  I have one comment.
>
>  +       a = &tlh->source[tlh->sourceNentries - 2].tluuid;
>  +       b = &tlh->target[tlh->targetNentries - 2].tluuid;
>  +
>  +       if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN) == 
> 0)
>  +               return true;
>  +
>  +       return memcmp(a, b, UUID_LEN) == 0;
>
>  Since we already have matchingTimelineUUID(), the above code can be 
> simplified
>  using it.
>
> Thank you for the review. I switched to using the matchingTimelineUUID() for 
> this part of the code and made some other
> minor improvements as well.

Here are some comments on v4.

1.
+/*
+ * Timeline histories for both clusters, populated by timelines_match().
+ */

I don't see a timelines_match() function.  Does this refer to
matchAndFetchTimelines()?

2.
+typedef struct TimelineHistoriesData
+{
+       TimeLineHistoryEntry *source,
+                          *target;
+       int                     sourceNentries,
+                               targetNentries;
+}                      TimelineHistoriesData;

I'd prefer to use TimeLineHistoriesData to stay consistent with
TimeLineHistoryEntry.  Anyway I'm not instant on it.

3.
+typedef TimelineHistoriesData * TimelineHistories;

The space between * and TimelineHistories is unnecessary — see
StringInfoData and other typedefs.

4.
+# node_x and node_b both start from the same TLI 1 baseline.
+my ($node_x, $node_b2) =
+  setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');

There appears to be a typo in the comment.  The node_b should be node_b2.


Everything else looks good.  Thank you again for updating the patch!

> --
> Best wishes,
> Mats Kindahl, Multigres Developer, Supabase

-- 
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

Re: pg_rewind does not rewind diverging timelines

Reply via email to