On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorot...@gmail.com> writes: > > I went trough the patchset. Everything looks good to me. I only did > > some improvements to comments and commit messages. I'm going to push > > this if no objections. > > There's apparently something wrong in the v17 branch, as three > separate buildfarm members have now hit timeout failures in > 046_checkpoint_logical_slot.pl [1][2][3]. I tried to reproduce > this locally, and didn't have much luck initially. However, > if I build with a configuration similar to grassquit's, it > will hang up maybe one time in ten: > > export > ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0' > > export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1' > > ./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 > -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare > -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' > --enable-injection-points > > The fact that 046_checkpoint_logical_slot.pl is skipped in > non-injection-point builds is probably reducing the number > of buildfarm failures, since only a minority of animals > have that turned on yet. > > I don't see anything obviously wrong in the test changes, and the > postmaster log from the failures looks pretty clearly like what is > hanging up is the pg_logical_slot_get_changes call: > > 2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: > statement: select count(*) from pg_logical_slot_get_changes('slot_logical', > null, null); > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting > logical decoding for slot "slot_logical" > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: > Streaming transactions committing after 0/290000F8, reading WAL from > 0/1540F40. > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: > select count(*) from pg_logical_slot_get_changes('slot_logical', null, null); > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical > decoding found consistent point at 0/1540F40 > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There > are no running transactions. > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: > select count(*) from pg_logical_slot_get_changes('slot_logical', null, null); > 2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received > immediate shutdown request > 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not > send data to client: Broken pipe > 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: > select count(*) from pg_logical_slot_get_changes('slot_logical', null, null); > 2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system > is shut down > > So my impression is that the bug is not reliably fixed in 17. > > One other interesting thing is that once it's hung, the test does > not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see > above that olingo took nearly 50 minutes to give up, and in > manual testing it doesn't seem to stop either (though I've not > got the patience to wait 50 minutes...)
Thank you for pointing! Apparently I've backpatched d3917d8f13e7 everywhere but not in REL_17_STABLE. Will be fixed now. ------ Regards, Alexander Korotkov Supabase