On 12/14/18 3:26 PM, Robert Haas wrote: > On Thu, Dec 13, 2018 at 12:17 AM Michael Paquier <mich...@paquier.xyz> wrote: >> On Wed, Dec 12, 2018 at 07:54:05AM -0500, David Steele wrote: >>> The LSN switch point is often the same even when servers are going to >>> different timelines. If the LSN is different enough then the problem >>> solves itself since the .partial will be on an entirely different >>> segment. >> >> That would mean that WAL forked exactly at the same record. You have >> likely seen more cases where than can happen in real life than I do. > > Suppose that the original master fails during an idle period, and we > promote a slave. But we accidentally promote a slave that can't serve > as the new master, like because it's in a datacenter with an > unreliable network connection or one which is about to be engulfed in > lava.
Much more common than people think. > So, we go to promote a different slave, and because we never > got around to reconfiguring the standbys to follow the previous > promotion, kaboom. Exactly. > Or, suppose we do PITR to recover from some user error, but then > somebody screws up the contents of the recovered cluster and we have > to do it over again. Naturally we'll recover to the same point. > > The new TLI is the only thing that is guaranteed to be unique with > each new promotion, and I would guess that it is therefore the right > thing to use to disambiguate them. This is the conclusion we came to after a few months of diagnosing and working on this problem. The question in my mind: is it safe to back-patch? -- -David da...@pgmasters.net