On Sun, Aug 07, 2022 at 03:18:52PM +0200, Tomas Vondra wrote: > On 8/7/22 02:36, Noah Misch wrote: > > On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote: > >> I've pushed a revert af all the commits related to this - decoding of > >> sequences and test_decoding / built-in replication changes. > > > > Two July buildfarm runs failed with PANIC during standby promotion: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13 > > > > The attached patch hacks things so an ordinary x86_64 GNU/Linux machine > > reproduces this consistently. "git bisect" then traced the regression to > > the > > above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5). The pg_ctl > > test suite passes under this hack in all supported branches, and it passed > > on > > v15 until that revert. Would you investigate? > > > > The buildfarm animal uses keep_error_builds. From kept data directories, I > > deduced these events: > > > > - After the base backup, auto-analyze ran on the primary and wrote WAL. > > - Standby streamed and wrote up to 0/301FFF. > > - Standby received the promote signal. Terminated streaming. WAL page at > > 0/302000 remained all-zeros. > > - Somehow, end-of-recovery became a PANIC. > > I think it'd be really bizarre if this was due to the revert, as that > simply undoes minor WAL changes (and none of this should affect what > happens at WAL page boundary etc.). It just restores WAL as it was > before 0da92dc, nothing particularly complicated. I did go through all > of the changes again and I haven't spotted anything particularly > suspicious, but I'll give it another try tomorrow. > > However, I did try bisecting this using the attached patch, and that > does not suggest the issue is in the revert commit. It actually fails > all the way back to 5dc0418fab2, and it starts working on 9553b4115f1. > > ... > 6392f2a0968 Try to silence "-Wmissing-braces" complaints in ... > => 5dc0418fab2 Prefetch data referenced by the WAL, take II. > 9553b4115f1 Fix warning introduced in 5c279a6d350. > ... > > This is merely 10 commits before the revert, and it seems way more > related to WAL. Also, adding this to the two nodes in 003_standby.pl > makes the issue go away, it seems: > > $node_standby->append_conf('postgresql.conf', > qq(recovery_prefetch = off)); > > I'd bet it's about WAL prefetching, not the revert, and the bisect was a > bit incorrect, because the commits are close and the failures happen to > be rare. (Presumably you first did the bisect and then wrote the patch > that reproduces this, right?)
No. I wrote the patch, then used the patch to drive the bisect. With ten iterations, commit 2c7ea57 passes 0/10, while 2c7ea57^ passes 10/10. I've now tried recovery_prefetch=off. With that, the test passes 10/10 at 2c7ea57. Given your observation of a failure at 5dc0418fab2, I agree with your conclusion. Whatever the role of 2c7ea57 in exposing the failure on my machine, a root cause in WAL prefetching looks more likely. > Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)