Hi, recently I got two postgres crashes on an installation that is running for years already and without significant changes recently.
Postgres is 15.15, OS is FreeBSD 14.3 The crashes are SIGBUS, happening on different db-clusters running on the same node from the same binary: col: Feb 2 03:52:57 LOG: background worker "parallel worker" (PID 79324) was terminated by signal 10: Bus error int: Feb 12 03:38:03 LOG: background worker "parallel worker" (PID 26340) was terminated by signal 10: Bus error On the second occurrance I looked into the coredump (which is sparse because this is a production build): * thread #1, name = 'postgres', stop reason = signal SIGBUS * frame #0: 0x0000000829930ac3 libc.so.7`___lldb_unnamed_symbol5890 + 131 frame #1: 0x000000082992da28 libc.so.7`___lldb_unnamed_symbol5865 + 504 frame #2: 0x000000082992e889 libc.so.7`___lldb_unnamed_symbol5871 + 2617 frame #3: 0x000000082990ca84 libc.so.7`___lldb_unnamed_symbol5446 + 644 frame #4: 0x000000082990c6b7 libc.so.7`___lldb_unnamed_symbol5445 + 839 frame #5: 0x0000000829952945 libc.so.7`___lldb_unnamed_symbol6064 + 21 frame #6: 0x0000000829900013 libc.so.7`___lldb_unnamed_symbol5410 + 755 frame #7: 0x00000000009c0577 postgres`AllocSetContextCreateInternal + 199 frame #8: 0x00000000006d588c postgres`ExecAssignExprContext + 108 frame #9: 0x00000000006faab9 postgres`ExecInitSeqScan + 73 frame #10: 0x00000000006cf188 postgres`ExecInitNode + 248 frame #11: 0x00000000006c8440 postgres`standard_ExecutorStart + 1056 frame #12: 0x00000000006cca12 postgres`ParallelQueryMain + 402 frame #13: 0x0000000000585f79 postgres`ParallelWorkerMain + 985 frame #14: 0x00000000007bc606 postgres`StartBackgroundWorker + 310 frame #15: 0x00000000007c1f00 postgres`maybe_start_bgworkers + 1104 frame #16: 0x00000000007c0a43 postgres`sigusr1_handler + 307 frame #17: 0x00000008228aa606 libthr.so.3`___lldb_unnamed_symbol688 + 214 frame #18: 0x00000008228a9b0a libthr.so.3`___lldb_unnamed_symbol669 + 314 frame #19: 0x0000000821a402d3 frame #20: 0x00000000007c2545 postgres`ServerLoop + 1605 frame #21: 0x00000000007bffa3 postgres`PostmasterMain + 3251 frame #22: 0x0000000000720601 postgres`main + 801 frame #23: 0x0000000829803190 libc.so.7`__libc_start1 + 304 frame #24: 0x00000000004ff4e4 postgres`_start + 36 I'm not sure what to make of this. A single crash might be due to a cosmic ray or whatever, a second occurrance usually means there is something wrong. That function AllocSetContextCreateInternal() seems to do some memory allocation. That somehow explains the SIGBUS event, and shifts the balance more to a software issue instead of a hardware issue. Forensics from the logfiles tell me that in both cases, the only running task that might use parallel workers, was a routine data collection job that runs at least every night - and a different one in both cases with no common parts, no special plugins used or whatever, just plain SQL. In between the two crashes, the postgres binaries were updated and the system subsequently rebooted. The system does not report any hardware issues, neither failures in other applications running. Memory ECC exists, and does actually work - I've seen that in the past. The two clusters use different physical disks. Postgres configuration is mostly as recommended. I was surprized to find that we now use *three* different shared-memory allocation tools, but the manual is clear about that: * 4096 byte from SysV shm (visible with ipcs) * shared buffers apparently from anonymous mmap() - nowhere visible in the system * dynamic shared buffers from Posix - these are visible with posixshmcontrol. Some sources say postgres would access the shared memory via handles under /dev/shm. But this is not possible because /dev/shm does not exist (by default on FreeBSD jails). Furthermore, the manual says postgres uses "a significant number" of semaphores, and that these are *not* SysV sem. They also are not Posix, because these do not exist - one would need to build a custom kernel to get them (according to "man 4 sem"). So far, this does not shed much light on the issue, except insofar as the "dynamic shared memory" seems historically intended specifically for parallel workers. One could assume a kind of coincidence, but looking closer, there are always some of these Posix shm present, on every cluster, and right from the start, parallel workers or not: # posixshmcontrol list MODE OWNER GROUP SIZE PATH rw------- postgres postgres 30976 /PostgreSQL.1991522144 rw------- postgres postgres 2097152 /PostgreSQL.45072524 rw------- postgres postgres 1048576 /PostgreSQL.1450298 Here are my config adjustments so far as they might somehow relate to memory allocation: max_connections = 60 # (change requires restart) shared_buffers = 40MB # min 128kB temp_buffers = 20MB # min 800kB work_mem = 50MB # min 64kB maintenance_work_mem = 50MB # min 1MB max_stack_depth = 40MB # min 100kB dynamic_shared_memory_type = posix # the default is the first option max_files_per_process = 200 # min 25 effective_io_concurrency = 5 # 1-1000; 0 disables prefetching synchronous_commit = off # synchronization level; -- PMc
