On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote: > Problem: restart() kills the walreceiver (as it should), which writes > that exact FATAL message to the log. The test then searches the log and > finds it.
Timing issue then, the buildfarm has not been complaining on this one AFAIK, there have been no recoveryCheck failures reported: https://buildfarm.postgresql.org/cgi-bin/show_failures.pl > The test has a comment claiming "a new log file is used on node > restart". TAP tests use pg_ctl with a fixed filename that gets reused > across restarts. No log rotation. I've fat-fingered this assumption, indeed, missing that one would need to do an extra rotate_logfile() before the restart. > The fix is obvious: check that the walreceiver PID stays constant. > That's what we actually care about anyway. Hmm. The reason why I didn't use a PID matching check (mentioned at [1]) is that this is not entirely bullet-proof. On a very slow machine, one could assume that standby_1 generates some records and that these are replayed by standby_2 *before* the PID of the WAL receiver is retrieved. This could lead to false positives in some cases, and a bunch of buildfarm members are very slow. You have a point that these would unlikely happen in normal runs, so a PID matching check would be relevant most of the time anyway, even if the original PID has been fetched after the TLI jump has been processed in standby_2. I'd rather keep the log check, TBH, bypassing it with an extra rotate_logfile() before the restart of standby_2. > This matters because changes to I/O behavior elsewhere in the code can > make this test fail spuriously. I hit it while working on O_CLOEXEC > handling for Windows. Fun. And the WAL receiver never stops after the restart of standby_2 with the log entry present in the server logs generated before the restart, right? [1]: https://www.postgresql.org/message-id/[email protected] -- Michael
signature.asc
Description: PGP signature
