After being up since the second week in Oct or so, yesterday our slurm controller started segfaultings. It was compiled/run on ubuntu 16.04.1.
Nov 12 14:31:48 nas-11-1 kernel: [2838306.311552] srvcn[9111]: segfault at 58 ip 00000000004b51fa sp 00007fbe270efb70 error 4 in slurmctld[400000+eb000] Nov 12 14:32:48 nas-11-1 kernel: [2838366.586784] srvcn[11217]: segfault at 58 ip 00000000004b51fa sp 00007f8f7cc41b70 error 4 in slurmctld[400000+eb000] Nov 12 14:33:48 nas-11-1 kernel: [2838426.761784] srvcn[13231]: segfault at 58 ip 00000000004b51fa sp 00007fb78a7e6b70 error 4 in slurmctld[400000+eb000] Nov 12 14:34:48 nas-11-1 kernel: [2838486.976987] srvcn[15228]: segfault at 58 ip 00000000004b51fa sp 00007ffb8e9e8b70 error 4 in slurmctld[400000+eb000] I compiled 18.08.3 on 18.04 and it hits the same problem. Now slurmctld segfaults shortly after boot: slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 1 slurmctld: debug2: Tree head got back 1 Segmentation fault (core dumped) If I look at the core dump: # gdb ./slurmctld GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git Reading symbols from ./slurmctld...done. (gdb) core ./core [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `./slurmctld -D -v -v -v'. Program terminated with signal SIGSEGV, Segmentation fault. #0 _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092 2092 i_first = bit_ffs(job_resrcs_ptr->node_bitmap); [Current thread is 1 (Thread 0x7f06a93d3700 (LWP 25825))] (gdb) bt #0 _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092 #1 post_job_step (step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:4720 #2 0x000055578571d1d8 in _post_job_step (step_ptr=0x555787af0f70) at step_mgr.c:270 #3 _internal_step_complete (job_ptr=job_ptr@entry=0x555787af04a0, step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:311 #4 0x000055578571d35c in job_step_complete (job_id=7035546, step_id=4294967295, uid=uid@entry=0, requeue=requeue@entry=false, job_return_code=<optimized out>) at step_mgr.c:878 #5 0x00005557856f0522 in _slurm_rpc_step_complete (msg=0x7f06a93d2e20, running_composite=<optimized out>) at proc_req.c:3863 #6 0x00005557856fde0b in slurmctld_req (msg=0x7f06a93d2e20, arg=0x7f067c001370) at proc_req.c:512 #7 0x00005557856897e2 in _service_connection (arg=<optimized out>) at controller.c:1274 #8 0x00007f06be41a6db in start_thread (arg=0x7f06a93d3700) at pthread_create.c:463 #9 0x00007f06be14388f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) Has anyone seen anything like this before?