At Fri, 14 May 2021 14:12:31 +0900 (JST), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in > At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmh...@gmail.com> wrote > in > > On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi > > <horikyota....@gmail.com> wrote: > > > It seems to me the issue here is not a race condition but > > > WaitForWALToBecomeAvailable initializing expectedTLEs with the history > > > of a improper timeline. So using recoveryTargetTLI instead of > > > receiveTLI for the case fixes this issue. > > > > I agree. > > > > > I believe the 004_timeline_switch.pl detects your issue. And the > > > attached change fixes it. > > > > So why does this use recoveryTargetTLI instead of receiveTLI only > > conditionally? Why not do it all the time? > > The commit ee994272ca apparently says that there's a case where primary
This is not an incomplete line but just a garbage. > > The hard thing about this code is that the assumptions are not very > > clear. If we don't know why something is a certain way, then we might > > break things if we change it. Worse yet, if nobody else knows why it's > > like that either, then who knows what assumptions they might be > > making? It's hard to be sure that any change is safe. > > Thanks for the comment. > > > But that being said, we have a clear definition from the comments for > > what expectedTLEs is supposed to contain, and it's only going to end > > up with those contents if we initialize it from recoveryTargetTLI. So > > I am inclined to think that we ought to do that always, and if it > > Yes, I also found it after that, and agreed. Desynchronization > between recoveryTargetTLI and expectedTLEs prevents > rescanLatestTimeline from working. > > > breaks something, then that's a sign that some other part of the code > > also needs fixing, because apparently that hypothetical other part of > > the code doesn't work if expctedTLEs contains what the comments say > > that it should. > > After some more inspection, I'm now also sure that it is a > typo/thinko. Other than while fetching the first checkpoint, > receivedTLI is always in the history of recoveryTargetTLI, otherwise > recoveryTargetTLI is equal to receiveTLI. > > I read that the commit message as "waiting for fetching possible > future history files to know if there's the future for the current > timeline. However now I read it as "don't bother expecting for > possiblly-unavailable history files when we're reading the first > checkpoint the timeline for which is already known to us.". If it is > correct we don't bother considering future history files coming from > primary there. > > > Now maybe that's the wrong idea. But if so, then we're saying that the > > definition of expectedTLEs needs to be changed, and we should update > > the comments with the new definition, whatever it is. A lot of the > > confusion here results from the fact that the code and comments are > > inconsistent and we can't tell whether that's intentional or > > inadvertent. Let's not leave the next person who looks at this code > > wondering the same thing about whatever changes we make. > > Ok. The reason why we haven't have a complain about that would be > that it is rare that pg_wal is wiped out before a standby connects to > a just-promoted primary. I'm not sure about the tool Dilip is using, > though.. > > So the result is the attached. This would be back-patcheable to 9.3 > (or 9.6?) but I doubt that we should do as we don't seem to have had a > complaint on this issue and we're not full faith on this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center