[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state
I finally had downtime on our cluster running 20.11.3 and decided to upgrade SLURM. All daemons were stopped on nodes and master. Rocky 8 Linux OS was updated but not changed configuration-wise in anyway. On the master, when I first installed 23.11.1 and tried to run slurmdbd -D -vvv at the command line, it balked as the more than 2 major version jump. I then installed 22.05.6 and run slurmdbd -D -vvv which took a long time but completed fine. I reinstalled 23.11.1 and did slurmdbd -D --vvv which went quick. I stopped it then ran slurmdbd normally with systemctl. When I ran slurmctld it complained about incompatable state files. I didn't care about loosing state so ran with the -i option to ignore. Seemed happy. I then started up slurmd on my various GPU and non-GPU nodes. I ran a test interactive job just to see if I would get a shell with expected SLURM environment on the nodes and it worked. Except on the GPU nodes I got errors about not being able to get/set GPU frequences. I removed GpuFreqDef=medium from slurm.conf and that went away. I had my people start running jobs last night. This morning I come in and see a lot of nodes are stuck in completing state. All these are GPU nodes. For example on node 'sevilla' squeue reports JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3679433 lcna100 sjob_123 sg1526 CG 8:29 1 sevilla 3679432 lcna100 sjob_123 sg1526 CG 8:35 1 sevilla 3679431 lcna100 sjob_123 sg1526 CG 8:42 1 sevilla In the slurmd.log on sevilla I see every minute the lines: [2024-01-28T09:37:52.003] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679432 [2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679431 [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug: _rpc_terminate_job: uid = 1150 JobId=3679433 [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T09:37:52.004] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB Back at the time 3679433 was cancelled by the user I see [2024-01-28T00:27:48.658] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T00:27:48.658] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-01-28T00:27:48.658] debug: _rpc_terminate_job: uid = 1150 JobId=3679433 [2024-01-28T00:27:48.658] debug: credential for job 3679433 revoked [2024-01-28T00:27:48.658] debug: _rpc_terminate_job: sent SUCCESS for 3679433, waiting for prolog to finish [2024-01-28T00:27:48.658] debug: Waiting for job 3679433's prolog to complete On sevilla itself there are no processes currently running for user sg1526 so no sign of these jobs. There ARE processes running for user mu40 though [root@sevilla ~]# pgrep -f slurm_script | xargs -n1 pstree -p | grep ^slurm slurm_script(30691)---starter-suid(30764)-+-python3(30784)-+-{python3}(30817) slurm_script(30861)---starter-suid(30873)-+-python3(30888)-+-{python3}(30915) slurm_script(30963)---starter-suid(30975)-+-python3(30992)-+-{python3}(31020) [root@sevilla ~]# strings /proc/30691/cmdline /bin/sh /var/slurm/spool/d/job3679220/slurm_script /usr/bin/sacct -p --jobs=3679220 --format=JobID,User,ReqTRES,NodeList,Start,End,Elapsed,CPUTime,State,ExitCode JobID|User|ReqTRES|NodeList|Start|End|Elapsed|CPUTime|State|ExitCode| 3679220|mu40|billing=6,cpu=3,gres/gpu=1,mem=32G,node=1|sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|NODE_FAIL|1:0| 3679220.batch|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED|| 3679220.extern|||sevilla|2024-01-27T22:33:27|2024-01-27T23:26:39|00:53:12|02:39:36|CANCELLED|| The slurmd.log has at the time it started lines like: [2024-01-27T22:33:28.679] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished [2024-01-27T22:33:58.092] [3679220.extern] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30650, probably hasn't started yet or has already finished [2024-01-27T22:33:58.166] [3679220.batch] debug2: profile signaling type Task [2024-01-27T22:33:58.168] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30691, probably hasn't started yet or has already finished [2024-01-27T22:33:58.170] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30764, probably hasn't started yet or has already finished [2024-01-27T22:33:58.172] [3679220.batch] debug2: gpu/nvml: _get_gpuutil: Couldn't find pid 30784, probably hasn'
Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state
Some more info on what I am seeing after the 23.11.3 upgrade. Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs=4 Partition=rtx8000 [2024-01-28T17:34:11.002] debug3: create_mmap_buf: loaded file `/var/slurm/spool/ctld/hash.3/job.3679903/script` as buf_t [2024-01-28T17:42:27.724] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=3679903 uid 5875902 [2024-01-28T17:42:27.725] debug: email msg to sg1526: Slurm Job_id=3679903 Name=sjob_1246 Ended, Run time 00:08:17, CANCELLED, ExitCode 0 [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: JobId=3679903 action:normal [2024-01-28T17:42:27.725] debug3: select/cons_tres: job_res_rm_job: removed JobId=3679903 from part rtx8000 row 0 [2024-01-28T17:42:27.726] job_signal: 9 of running JobId=3679903 successful 0x8004 [2024-01-28T17:43:19.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:44:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:45:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:46:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 [2024-01-28T17:47:20.000] Resending TERMINATE_JOB request JobId=3679903 Nodelist=rtx-06 So at 17:42 the user must of done an scancel. In the slurmd log on the node I see: [2024-01-28T17:42:27.727] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:42:27.728] debug: credential for job 3679903 revoked [2024-01-28T17:42:27.728] debug: _rpc_terminate_job: sent SUCCESS for 3679903, waiting for prolog to finish [2024-01-28T17:42:27.728] debug: Waiting for job 3679903's prolog to complete [2024-01-28T17:43:19.002] debug: _rpc_terminate_job: uid = 1150 _JobId=3679903 [2024-01-28T17:44:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:45:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:46:20.001] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 [2024-01-28T17:47:20.002] debug: _rpc_terminate_job: uid = 1150 JobId=3679903 Strange that a prolog is being called on job cancel slurmd seems to be getting the repeated calls to terminate the job from slurmctld but it is not happening. Also the process table has [root@rtx-06 ~]# ps auxw | grep slurmd root 161784 0.0 0.0 436748 21720 ?Ssl 13:44 0:00 /usr/sbin/slurmd --systemd root 190494 0.0 0.0 0 0 ?Zs 17:34 0:00 [slurmd] where there is now a zombie slurmd process I cannot kill even with kill -9 If I do a 'systemctl stop slurmd' it takes a long time but eventually stop slurmd and gets rid of the zombie process but kills the "good" running jobs too with NODE_FAIL. Another case is where a job will be cancelled and SLURM acts like it is cancelled with it not showing up in squeue but the process keep running on the box. # pstree -u sg1526 -p | grep ^slurm slurm_script(185763)---python(185796)-+-{python}(185797) # strings /proc/185763/environ | grep JOB_ID SLURM_JOB_ID=3679888 # squeue -j 3679888 slurm_load_jobs error: Invalid job id specified sacct shows that job being cancelled. In the slurmd log we see [2024-01-28T17:33:58.757] debug: _rpc_terminate_job: uid = 1150 JobId=3679888 [2024-01-28T17:33:58.757] debug: credential for job 3679888 revoked [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967292: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967292 [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.extern stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug: _step_connect: connect() failed for /var/slurm/spool/d/rtx-06_3679888.4294967291: Connection refused [2024-01-28T17:33:58.757] debug: Cleaned up stray socket /var/slurm/spool/d/rtx-06_3679888.4294967291 [2024-01-28T17:33:58.757] _handle_stray_script: Purging vestigial job script /var/slurm/spool/d/job3679888/slurm_script [2024-01-28T17:33:58.757] debug: signal for nonexistent StepId=3679888.batch stepd_connect failed: Connection refused [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 were able to be signaled with 18 [2024-01-28T17:33:58.757] debug2: No steps in jobid 3679888 to send signal 15 [2024-01-28T17:33:58.757] debug2: set revoke expiration for jobid 3679888 to 1706481358 UTS [2024-01-28T17:33:58.757] debug: Waiting for job 3679888's prolog to complete [2024-01-28T17:33:58.757] debug: Finished wait for job 3679888's prolog to complete [2024-01-28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0 -- Paul Raines (http://help.nmr.mgh.harvard.edu) Please note that this e-mail is not secure (encryp