On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > > On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmh...@gmail.com> wrote: > > > > The attached test script, test.sh seems to reliably reproduce this. > > Put that file and the recalcitrant_cp script, also attached, into an > > I haven't tested this, but I will do that. But now we are on the same > page about the cause of the actual problem I reported.
Now, I have tested. I am able to reproduce the actual problem with your script. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > > On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmh...@gmail.com> wrote: > > > > On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmh...@gmail.com> wrote: > > > I had trouble following it completely, but I didn't really spot > > > anything that seemed definitely wrong. However, I don't understand > > > what it has to do with where we are now. What I want to understand is: > > > under exactly what circumstances does it matter that > > > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM, > > > will stream from receiveTLI rather than recoveryTargetTLI? > > > > Ah ha! I think I figured it out. To hit this bug, you need to meet the > > following conditions: > > > > 1. Both streaming and archiving have to be configured. > > 2. You have to promote a new primary. > > 3. After promoting the new primary you have to start a new standby > > that doesn't have local WAL and for which the backup was taken from > > the previous timeline. In Dilip's original scenario, this new standby > > is actually the old primary, but that's not required. > > No, in my original scenario also the new standby was not old primary, > I had 3 nodes > node1-> primary, node2 -> standby1, node3-> standby2 > node2 promoted as a new primary and node3's local WAL was removed (so > that it has to stream checkpoint record from new primary and then > remaining everything happens as you explain in remaining steps). > > > 4. The new standby has to be able to find the history file it needs in > > the archive but not the WAL files. > > 5. The new standby needs to have recovery_target_timeline='latest' > > (which is the default) > > > > When you start the new standby, it will fetch the current TLI from its > > control file. Then, since recovery_target_timeline=latest, the system > > will try to figure out the latest timeline, which only works because > > archiving is configured. There seems to be no provision for detecting > > the latest timeline via streaming. With archiving enabled, though, > > findNewestTimeLine() will be able to restore the history file created > > by the promotion of the new primary, which will cause > > validateRecoveryParameters() to change recoveryTargetTLI. > > Right > > Then we'll > > try to read the WAL segment containing the checkpoint record and fail > > because, by stipulation, only history files are available from the > > archive. Now, because streaming is also configured, we'll try > > streaming. That will work, so we'll be able to read the checkpoint > > record, but now, because WaitForWALToBecomeAvailable() initialized > > expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't > > switch to the correct timeline and it all goes wrong. > > exactly > > > The attached test script, test.sh seems to reliably reproduce this. > > Put that file and the recalcitrant_cp script, also attached, into an > > empty directory, cd to that directory, and run test.sh. Afterwards > > examine pgcascade.log. Basically, these scripts just set up the > > scenario described above. We set up primary and a standby that use > > recalcitrant_cp as the archive command, and because it's recalcitrant, > > it's only willing to copy history files, and always fails for WAL > > files.Then we create a cascading standby by taking a base backup from > > the standby, but before actually starting it, we promote the original > > standby. So now it meets all the conditions described above. I tried a > > couple variants of this test. If I switch the archive command from > > recalcitrant_cp to just regular cp, then there's no problem. And if I > > switch it to something that always fails, then there's also no > > problem. That's because, with either of those changes, condition (4) > > above is no longer met. In the first case, both files end up in the > > archive, and in the second case, neither file. > > I haven't tested this, but I will do that. But now we are on the same > page about the cause of the actual problem I reported. > > > What about hitting this in real life, with a real archive command? > > Well, you'd probably need the archive command to be kind of slow and > > get unlucky on the timing, but there's nothing to prevent it from > > happening. > > Right > > > But, it will be WAY more likely if you have Dilip's original scenario, > > where you try to repurpose an old primary as a standby. It would > > normally be unlikely that the backup used to create a new standby > > would have an older TLI, because you typically wouldn't switch masters > > in between taking a base backup and using it to create a new standby. > > But the old master always has an older TLI. So (3) is satisfied. For > > (4) to be satisfied, you need the old master to fail to archive all of > > its WAL when it shuts down. > > For my original case, both standby1 and standby2 are connected to the > primary. Now, standby1 is promoted and standby2 is shut down. And, > before restarting, all the local WAL of the standby2 is removed so > that it can follow the new primary. The primary info and restore > command for standby2 are changed as per the new primary(standby1). > > Now the scenario is that the standby1 has switched the timeline in the > middle of the segment which contains the checkpoint record, so the > segment with old TL is renamed to (.partial) and the same segment with > new TL is not yet archived but the history file for the new TL has > been archived. > > Now, when standby2 restart the remaining things happened as you > explained, basically it restores the history file and changes the > recoveryTargetTLI but it doesn't get the WAL file from the archive. > So try to stream checkpoint record from the primary using > "ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline. > > Now, we may ask that if the WAL segment with old TL on standby1(new > primary) which contains the checkpoint is already renamed to > ".partial" then how can it stream using the old TL then the answer is > below code[1] in the walsender. Basically, the checkpoint record is > present in both new and old TL as TL switched in the middle of the > segment, it will send you the record from the new TL even if the > walreciever asks to stream with old TL. Now walrecievr is under > impression that it has read from the old TL. And, we know the rest of > the story that we will set the expectedTLEs based on the old history > file and never be able to go to the new TL. > > Anyways now we understand the issue and there are many ways we can > reproduce it. Still, I thought of explaining the exact steps how it > happed for me because now we understand it well so I think it is easy > to explain :) > > [1] > WalSndSegmentOpen() > { > /*------- > * When reading from a historic timeline, and there is a timeline switch > * within this segment, read from the WAL segment belonging to the new > * timeline. > } > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com