Hi, On 2022-03-27 15:28:05 -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > How about the attached variation, which retries (for 15s, with 100ms sleeps) > > if there are multiple walsenders, printing the debugging info each time? > > It'll > > still fail the test if later iterations find just one walsender, which seems > > the right behaviour for now. > > We have now four instances of failures with this version of the test, > and in every case the second iteration succeeded. Is that enough > evidence yet?
I still feel like there's something off here. But that's probably not enough to keep causing failures. I'm inclined to leave the debugging in for a bit longer, but not fail the test anymore? The way the temporary slot removal hangs for a while seems just off: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-03-27%2017%3A04%3A18 2022-03-27 13:30:56.993 EDT [2022-03-27 13:30:56 EDT 750695:20] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695: removed on-disk ... 2022-03-27 13:30:57.456 EDT [2022-03-27 13:30:57 EDT 750759:3] [unknown] LOG: connection authorized: user=andrew database=postgres application_name=019_replslot_limit.pl 2022-03-27 13:30:57.466 EDT [2022-03-27 13:30:57 EDT 750759:4] 019_replslot_limit.pl LOG: statement: SELECT * FROM pg_stat_activity . 2022-03-27 13:30:57.474 EDT [2022-03-27 13:30:56 EDT 750679:87] DEBUG: server process (PID 750759) exited with exit code 0 2022-03-27 13:30:57.507 EDT [2022-03-27 13:30:56 EDT 750695:21] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695: done https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2009%3A00%3A09 2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: begin 2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: removed on-disk ... 2022-03-25 10:13:31.038 CET [4022841][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl 2022-03-25 10:13:31.039 CET [4022841][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity ... 2022-03-25 10:13:31.045 CET [4022807][postmaster][:0][] DEBUG: server process (PID 4022841) exited with exit code 0 2022-03-25 10:13:31.056 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819: done https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-03-25%2008%3A02%3A05 2022-03-25 09:15:20.558 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425: removed on-disk ... 2022-03-25 09:15:20.803 CET [3730461][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl 2022-03-25 09:15:20.804 CET [3730461][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity ... 2022-03-25 09:15:20.834 CET [3730381][postmaster][:0][] DEBUG: server process (PID 3730461) exited with exit code 0 2022-03-25 09:15:20.861 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425: done https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2003%3A00%3A18 2022-03-25 04:14:03.025 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398: removed on-disk ... 2022-03-25 04:14:03.264 CET [2674463][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgres application_name=019_replslot_limit.pl 2022-03-25 04:14:03.265 CET [2674463][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity ... 2022-03-25 04:14:03.270 CET [2674384][postmaster][:0][] DEBUG: server process (PID 2674463) exited with exit code 0 ... 2022-03-25 04:14:03.324 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398: done We are able to start / finish several new connections between the two debug elog()sin ReplicationSlotDropPtr()? I wonder if there's something odd going on with ConditionVariableBroadcast(). Might be useful to add another debug message before/after ConditionVariableBroadcast() and rmtree(). If the delay is due to rmtree() being slow under concurrent tests I'd feel less concerned (although that machine has dual NVMe drives...). I really wish I could reproduce the failure. I went through a few hundred cycles of that test in a separate checkout on that machine. > I'd like to silence this noise so that we can start tracking > lower-probability failure modes, like say these: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2022-03-26%2002%3A59%3A03 That one was a missing compiler flag. I see that failure reproducibly locally when using asan with clang, unless I use -fsanitize-address-use-after-return=never. gcc has a different default for the option, which is why I hadn't configured it. If I understand correctly, the problem is that -fsanitize-address-use-after-return uses an alternative stack. Sometimes our stack depths functions get called with the "proper" stack, and sometimes with a "shadow" stack. Which breaks our stack depth checking. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-26%2015%3A53%3A51 This one I have no idea about yet. I assume it's just a race in a new test... Greetings, Andres Freund