Hi,

Unfortunately it seems we're not out of the woods yet :-(

I started doing some more testing on the v8 patch. My plan was to do
some stress testing with physical replication, random restarts and stuff
like that. But I ran into issues before that.

Attached is a reproducer script, that does this:

1) initializes an instance with a small (scale 10) pgbench database

2) runs a pgbench in the background, and flips checksums

3) restarts the database with fast or immediate mode

4) watches for checksums state until it reaches expected value

5) restarts the instance

Of course, the restart interrupts the checksum enable, with this message
in the log:

WARNING:  data checksums are being enabled, but no worker is running
1731024482.102 2024-11-08 01:08:02.102 CET [267066] [startup:]
[672d5660.4133a:7] [2024-11-08 01:08:00 CET] [/0] HINT:  If checksums
were being enabled during shutdown then processing must be manually
restarted.

That's expected, of course. So I did

  SELECT pg_enable_data_checksums()

and "datachecksumsworker launcher" appeared in pg_stat_activity, but
nothing else was happening. It also says:

  Waiting for worker in database template0 (pid 258442)

But there are no workers with that PID. Not in the OS, not in the view,
not in the server log. Seems a bit weird. Maybe it already completed,
but then why is there a launcher waiting for it?

Ultimately I tried running CHECKPOINT, And that apparently did the
trick, and the instance restarted. But then on start it hits an assert that:

  (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)

But this only happens in the final stop is -m immediate. If I change it
to "-m fast" it works.

I haven't looked into the details, but I guess it's related to the issue
with controlfile update we dealt with about a month ago.

Attached is the test.sh file (make sure to tweak the paths), and an
example of the backtraces. I've seen various processes hitting that.


Two more comments:

* It's a bit surprising that pg_disable_data_checksums() flips the state
right away, while pg_enable_data_checksums() waits for a checkpoint. I
guess it's correct, but maybe the docs should mention this difference?

* The docs currently say:

<para>
 If the cluster is stopped while in <literal>inprogress-on</literal> mode,
 for any reason, then this process must be restarted manually. To do this,
 re-execute the function <function>pg_enable_data_checksums()</function>
 once the cluster has been restarted. The background worker will attempt
 to resume the work from where it was interrupted.
</para>

I believe that's incorrect/misleading. There's no attempt to resume work
from where it was interrupted.


regards

-- 
Tomas Vondra

Attachment: test.sh
Description: application/shellscript

Core was generated by `postgres: startup recovering 000000010000000000000036    
            '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f9df42abe9f in __pthread_kill_internal (signo=6, threadid=<optimized 
out>) at ./nptl/pthread_kill.c:78
#2  0x00007f9df425cfb2 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#3  0x00007f9df4247472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000563597dfedc6 in ExceptionalCondition (conditionName=0x563597eb1e28 
"LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION", 
    fileName=0x563597eb0d6c "xlog.c", lineNumber=4890) at assert.c:66
#5  0x0000563597774f78 in AbsorbChecksumsOnBarrier () at xlog.c:4890
#6  0x0000563597be8dd7 in ProcessProcSignalBarrier () at procsignal.c:581
#7  0x0000563597c1cfcf in ProcessInterrupts () at postgres.c:3495
#8  0x0000563597bf1b62 in ConditionVariableTimedSleep (cv=0x7f9df1b380dc, 
timeout=5000, wait_event_info=134217772) at condition_variable.c:196
#9  0x0000563597be8b69 in WaitForProcSignalBarrier (generation=1) at 
procsignal.c:445
#10 0x000056359777baaf in xlog_redo (record=0x56359d89bc68) at xlog.c:9006
#11 0x000056359778aef0 in ApplyWalRecord (xlogreader=0x56359d89bc68, 
record=0x7f9df217f070, replayTLI=0x7ffc9d15c154) at xlogrecovery.c:1992
#12 0x000056359778aa1b in PerformWalRecovery () at xlogrecovery.c:1822
#13 0x0000563597777135 in StartupXLOG () at xlog.c:6193
#14 0x0000563597b381d4 in StartupProcessMain (startup_data=0x0, 
startup_data_len=0) at startup.c:258
#15 0x0000563597b313b8 in postmaster_child_launch (child_type=B_STARTUP, 
startup_data=0x0, startup_data_len=0, client_sock=0x0) at launch_backend.c:266
#16 0x0000563597b374cf in StartChildProcess (type=B_STARTUP) at 
postmaster.c:3733
#17 0x0000563597b366ad in PostmasterStateMachine () at postmaster.c:3117
#18 0x0000563597b359ab in process_pm_child_exit () at postmaster.c:2541
#19 0x0000563597b3465f in ServerLoop () at postmaster.c:1641
#20 0x0000563597b34044 in PostmasterMain (argc=3, argv=0x56359d84c940) at 
postmaster.c:1348
#21 0x00005635979e57b0 in main (argc=3, argv=0x56359d84c940) at main.c:196


Core was generated by `postgres: background writer                              
            '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f9df42abe9f in __pthread_kill_internal (signo=6, threadid=<optimized 
out>) at ./nptl/pthread_kill.c:78
#2  0x00007f9df425cfb2 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#3  0x00007f9df4247472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000563597dfedc6 in ExceptionalCondition (conditionName=0x563597eb1e28 
"LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION", 
    fileName=0x563597eb0d6c "xlog.c", lineNumber=4890) at assert.c:66
#5  0x0000563597774f78 in AbsorbChecksumsOnBarrier () at xlog.c:4890
#6  0x0000563597be8dd7 in ProcessProcSignalBarrier () at procsignal.c:581
#7  0x0000563597b311b1 in HandleMainLoopInterrupts () at interrupt.c:37
#8  0x0000563597b2d927 in BackgroundWriterMain (startup_data=0x0, 
startup_data_len=0) at bgwriter.c:229
#9  0x0000563597b313b8 in postmaster_child_launch (child_type=B_BG_WRITER, 
startup_data=0x0, startup_data_len=0, client_sock=0x0) at launch_backend.c:266
#10 0x0000563597b374cf in StartChildProcess (type=B_BG_WRITER) at 
postmaster.c:3733
#11 0x0000563597b36785 in LaunchMissingBackgroundProcesses () at 
postmaster.c:3158
#12 0x0000563597b3474f in ServerLoop () at postmaster.c:1665
#13 0x0000563597b34044 in PostmasterMain (argc=3, argv=0x56359d84c940) at 
postmaster.c:1348
#14 0x00005635979e57b0 in main (argc=3, argv=0x56359d84c940) at main.c:196

Reply via email to