Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Alexander Korotkov Sat, 19 Jul 2025 15:31:22 -0700

On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <[email protected]> wrote:
> Alexander Korotkov <[email protected]> writes:
> > I went trough the patchset.  Everything looks good to me.  I only did
> > some improvements to comments and commit messages.  I'm going to push
> > this if no objections.
>
> There's apparently something wrong in the v17 branch, as three
> separate buildfarm members have now hit timeout failures in
> 046_checkpoint_logical_slot.pl [1][2][3].  I tried to reproduce
> this locally, and didn't have much luck initially.  However,
> if I build with a configuration similar to grassquit's, it
> will hang up maybe one time in ten:
>
> export 
> ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0'
>
> export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1'
>
> ./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 
> -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare 
> -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' 
> --enable-injection-points
>
> The fact that 046_checkpoint_logical_slot.pl is skipped in
> non-injection-point builds is probably reducing the number
> of buildfarm failures, since only a minority of animals
> have that turned on yet.
>
> I don't see anything obviously wrong in the test changes, and the
> postmaster log from the failures looks pretty clearly like what is
> hanging up is the pg_logical_slot_get_changes call:
>
> 2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG:  
> statement: select count(*) from pg_logical_slot_get_changes('slot_logical', 
> null, null);
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG:  starting 
> logical decoding for slot "slot_logical"
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL:  
> Streaming transactions committing after 0/290000F8, reading WAL from 
> 0/1540F40.
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT:  
> select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG:  logical 
> decoding found consistent point at 0/1540F40
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL:  There 
> are no running transactions.
> 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT:  
> select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG:  received 
> immediate shutdown request
> 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG:  could not 
> send data to client: Broken pipe
> 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT:  
> select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
> 2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG:  database system 
> is shut down
>
> So my impression is that the bug is not reliably fixed in 17.
>
> One other interesting thing is that once it's hung, the test does
> not stop after PG_TEST_TIMEOUT_DEFAULT elapses.  You can see
> above that olingo took nearly 50 minutes to give up, and in
> manual testing it doesn't seem to stop either (though I've not
> got the patience to wait 50 minutes...)


Thank you for pointing!
Apparently I've backpatched d3917d8f13e7 everywhere but not in
REL_17_STABLE.  Will be fixed now.

------
Regards,
Alexander Korotkov
Supabase

Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Reply via email to