On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <to...@vondra.me> wrote: > > On 7/4/25 20:12, Tomas Vondra wrote: > > On 7/4/25 13:05, Jakub Wartak wrote: > >> ... > >> > >> 8. v1-0005 2x + /* if (numa_procs_interleave) */ > >> > >> Ha! it's a TRAP! I've uncommented it because I wanted to try it out > >> without it (just by setting GUC off) , but "MyProc->sema" is NULL : > >> > >> 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL > >> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit > >> [..] > >> 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) > >> was terminated by signal 11: Segmentation fault > >> 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other > >> active server processes > >> 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because > >> "restart_after_crash" is off > >> 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down > >> > >> [New LWP 28755] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library > >> "/lib/x86_64-linux-gnu/libthread_db.so.1". > >> Core was generated by `postgres: io worker '. > >> Program terminated with signal SIGSEGV, Segmentation fault. > >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) > >> at ./nptl/sem_waitcommon.c:136 > >> 136 ./nptl/sem_waitcommon.c: No such file or directory. > >> (gdb) where > >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) > >> at ./nptl/sem_waitcommon.c:136 > >> #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 > >> #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at > >> ../src/backend/port/posix_sema.c:302 > >> #3 0x0000556191970553 in InitAuxiliaryProcess () at > >> ../src/backend/storage/lmgr/proc.c:992 > >> #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at > >> ../src/backend/postmaster/auxprocess.c:65 > >> #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized > >> out>, startup_data_len=<optimized out>) at > >> ../src/backend/storage/aio/method_worker.c:393 > >> #6 0x00005561918e8163 in postmaster_child_launch > >> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, > >> startup_data=startup_data@entry=0x0, > >> startup_data_len=startup_data_len@entry=0, > >> client_sock=client_sock@entry=0x0) at > >> ../src/backend/postmaster/launch_backend.c:290 > >> #7 0x00005561918ea09a in StartChildProcess > >> (type=type@entry=B_IO_WORKER) at > >> ../src/backend/postmaster/postmaster.c:3973 > >> #8 0x00005561918ea308 in maybe_adjust_io_workers () at > >> ../src/backend/postmaster/postmaster.c:4404 > >> [..] > >> (gdb) print *MyProc->sem > >> Cannot access memory at address 0x0 > >> > > > > Yeah, good catch. I'll look into that next week. > > > > I've been unable to reproduce this issue, but I'm not sure what settings > you actually used for this instance. Can you give me more details how to > reproduce this?
Better late than never, well feel free to partially ignore me, i've missed that it is known issue as per FIXME there, but I would just rip out that commented out `if(numa_proc_interleave)` from FastPathLockShmemSize() and PGProcShmemSize() unless you want to save those memory pages of course (in case of no-NUMA). If you do want to save those pages I think we have problem: For complete picture, steps: 1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch 2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch BTW the pgbench accidentinal ident is still there (part of v2-0001 patch)) 14 out of 14 hunks FAILED -- saving rejects to file src/bin/pgbench/pgbench.c.rej 3. As I'm just applying 0001 and 0006, I've got two simple rejects, but fixed it (due to not applying missing numa_ freelist patches). That's intentional on my part, because I wanted to play just with those two. 4. Then I uncomment those two "if (numa_procs_interleave)" related for optional memory shm initialization - add_size() and so on (that have XXX comment above that it is causing bootstrap issues) 5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok 6. restart with numa_procs_interleave=off, which gets me to every bg worker crashing e.g.: (gdb) where #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at ./nptl/sem_waitcommon.c:136 #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 #2 0x0000563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at ../src/backend/port/posix_sema.c:302 #3 0x0000563e2d774d93 in InitAuxiliaryProcess () at ../src/backend/storage/lmgr/proc.c:995 #4 0x0000563e2d6e9252 in AuxiliaryProcessMainCommon () at ../src/backend/postmaster/auxprocess.c:65 #5 0x0000563e2d6eb683 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:190 #6 0x0000563e2d6ec363 in postmaster_child_launch (child_type=child_type@entry=B_CHECKPOINTER, child_slot=249, startup_data=startup_data@entry=0x0, startup_data_len=startup_data_len@entry=0, client_sock=client_sock@entry=0x0) at ../src/backend/postmaster/launch_backend.c:290 #7 0x0000563e2d6ee29a in StartChildProcess (type=type@entry=B_CHECKPOINTER) at ../src/backend/postmaster/postmaster.c:3973 #8 0x0000563e2d6f17a6 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x563e377cc0e0) at ../src/backend/postmaster/postmaster.c:1386 #9 0x0000563e2d4948fc in main (argc=3, argv=0x563e377cc0e0) at ../src/backend/main/main.c:231 notice sema=0x0, because: #3 0x000056050928cd93 in InitAuxiliaryProcess () at ../src/backend/storage/lmgr/proc.c:995 995 PGSemaphoreReset(MyProc->sem); (gdb) print MyProc $1 = (PGPROC *) 0x7f09a0c013b0 (gdb) print MyProc->sem $2 = (PGSemaphore) 0x0 or with printfs: 2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal PGPROC=0x7f9de827b880 requestSize=148770 // after proc && ptr manipulation: 2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal PGPROC=0x7f9de827bdf0 requestSize=148770 procs=0x7f9de827b880 ptr=0x7f9de827bdf0 [..initialization of aux PGPROCs i=0.., still fromInitProcGlobal(), each gets proper sem allocated as one would expect:] [..for i loop:] 2025-07-25 11:17:23.689 CEST [21772] LOG: i=136 , proc=0x7f9de8600000, proc->sem=0x7f9da4e04438 2025-07-25 11:17:23.689 CEST [21772] LOG: i=137 , proc=0x7f9de8600348, proc->sem=0x7f9da4e044b8 2025-07-25 11:17:23.689 CEST [21772] LOG: i=138 , proc=0x7f9de8600690, proc->sem=0x7f9da4e04538 [..but then in the children codepaths, out of the blue in InitAuxilaryProcess the whole MyProc looks like it would memsetted to zeros:] 2025-07-25 11:17:23.693 CEST [21784] LOG: auxiliary process using MyProc=0x7f9de8600000 auxproc=0x7f9de8600000 proctype=0 MyProcPid=21784 MyProc->sem=(nil) above got pgproc slot i=136 with addr 0x7f9de8600000 and later that auxiliary is launched but somehow something NULLified ->sem there (according to gdb , everything is zero there) 7. Original patch v2-0006 (with commented out 2x if numa_procs_interleave), behaves OK, so in my case here with 1x NUMA node that gives add_size(.., 1+1 * 2MB)=4MB 2025-07-25 11:38:54.131 CEST [23939] LOG: in InitProcGlobal PGPROC=0x7f25cbe7b880 requestSize=4343074 2025-07-25 11:38:54.132 CEST [23939] LOG: in InitProcGlobal PGPROC=0x7f25cbe7bdf0 requestSize=4343074 procs=0x7f25cbe7b880 ptr=0x7f25cbe7bdf0 so something is zeroing out all those MyProc structures apparently on startup (probably due to some wrong alignment maybe somewhere ?) I was thinking about trapping via mprotect() this single i=136 0x7f9de8600000 PGPROC to see what is resetting it, but oh well, mprotect() works only on whole pages... -J.