Update, So finally problem solved. In case someone runs into something similar, the issue was that there were file remains in /var/spool/slurmd/ from the previous ubuntu version (22.04), for some reason that made slurmd behave like that. So deleting all those files inside solved the problem!
On Sat, Sep 28, 2024, 2:13 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Dear community, > I am having a strange issue I have been unable to find the cause. Last > week I did a full update on the cluster, which is composed of the master > node, and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom > GPU server). After the update, I got > > - master node ended up with Ubuntu 24.04, > - nodeGPU01 with latest DGX OS (still Ubuntu 22.04) > - nodeGPU02 with Ubuntu 24.04 LTS. > - Launching jobs from master choosing the partitions of nodeGPU01 > works perfectly. > - Launching jobs from master choosing the partition of nodeGPU02 > stopped working (hangs). > > The nodeGPU02 (Ubuntu 24) is no longer processing jobs successfully, while > the other nodeGPU01 works perfectly even when the master has Ubuntu 24. > Any help is welcome, I have tried many things and had no success in > finding the cause of this. Please let me know if you need more information. > Many thanks in advance. > > This is the initial `slurmd` log of the problematic node (nodeGPU02), > notice the messages in yellow > > ➜ ~ sudo systemctl status slurmd.service > ● slurmd.service - Slurm node daemon > Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; preset: > enabled) > Active: active (running) since Sat 2024-09-28 14:00:22 -03; 4s ago > Main PID: 4821 (slurmd) > Tasks: 1 > Memory: 17.0M (peak: 29.7M) > CPU: 174ms > CGroup: /system.slice/slurmd.service > └─4821 /usr/sbin/slurmd -D -s > > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: MPI: Loading all > types > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init: > PMIx plugin loaded > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init: > PMIx plugin loaded > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: No mpi.conf file > (/etc/slurm/mpi.conf) > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: slurmd started on Sat, 28 > Sep 2024 14:00:25 -0300 > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect: > connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection > refused > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: health_check > success rc:0 output: > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: CPUs=128 Boards=1 > Sockets=2 Cores=64 Threads=1 Memory=773744 TmpDisk=899181 Uptime=2829 > CPUSpecList=(null) FeaturesAvail=(nu> > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect: > connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection > refused > Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: > _handle_node_reg_resp: slurmctld sent back 11 TRES > > This is the verbose output of the srun command (notice yellow messages). > ➜ ~ srun -vvvp rtx hostname > srun: defined options > srun: -------------------- -------------------- > srun: partition : rtx > srun: verbose : 3 > srun: -------------------- -------------------- > srun: end of defined options > srun: debug: propagating RLIMIT_CPU=18446744073709551615 > srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 > srun: debug: propagating RLIMIT_DATA=18446744073709551615 > srun: debug: propagating RLIMIT_STACK=8388608 > srun: debug: propagating RLIMIT_CORE=0 > srun: debug: propagating RLIMIT_RSS=18446744073709551615 > srun: debug: propagating RLIMIT_NPROC=3090276 > srun: debug: propagating RLIMIT_NOFILE=1024 > srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 > srun: debug: propagating RLIMIT_AS=18446744073709551615 > srun: debug: propagating SLURM_PRIO_PROCESS=0 > srun: debug: propagating UMASK=0002 > srun: debug: Entering slurm_allocation_msg_thr_create() > srun: debug: port from net_stream_listen is 34081 > srun: debug: Entering _msg_thr_internal > srun: Waiting for resource configuration > srun: Nodes nodeGPU02 are ready for job > srun: jobid 57463: nodes(1):`nodeGPU02', cpu counts: 1(x1) > srun: debug2: creating job with 1 tasks > srun: debug2: cpu:1 is not a gres: > srun: debug: requesting job 57463, user 99, nodes 1 including ((null)) > srun: debug: cpus 1, tasks 1, name hostname, relative 65534 > srun: CpuBindType=(null type) > srun: debug: Entering slurm_step_launch > srun: debug: mpi/pmix_v4: pmixp_abort_agent_start: (null) [0]: > pmixp_agent.c:382: Abort agent port: 41393 > srun: debug: mpi/pmix_v4: mpi_p_client_prelaunch: (null) [0]: > mpi_pmix.c:285: setup process mapping in srun > srun: debug: Entering _msg_thr_create() > srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]: > pmixp_agent.c:353: Start abort thread > srun: debug: initialized stdio listening socket, port 33223 > srun: debug: Started IO server thread (140079189182144) > srun: debug: Entering _launch_tasks > srun: launching StepId=57463.0 on host nodeGPU02, 1 tasks: 0 > srun: debug2: Called _file_readable > srun: debug2: Called _file_writable > srun: route/default: init: route default plugin loaded > srun: debug2: Called _file_writable > srun: topology/none: init: topology NONE plugin loaded > srun: debug2: Tree head got back 0 looking for 1 > srun: debug: slurm_recv_timeout at 0 of 4, timeout > srun: error: slurm_receive_msgs: [[nodeGPU02]:6818] failed: Socket timed > out on send/recv operation > srun: debug2: Tree head got back 1 > srun: debug: launch returned msg_rc=1001 err=5004 type=9001 > srun: debug2: marking task 0 done on failed node 0 > srun: error: Task launch for StepId=57463.0 failed on node nodeGPU02: > Socket timed out on send/recv operation > srun: error: Application launch failed: Socket timed out on send/recv > operation > srun: Job step aborted > srun: debug2: false, shutdown > srun: debug2: false, shutdown > srun: debug2: Called _file_readable > srun: debug2: Called _file_writable > srun: debug2: Called _file_writable > srun: debug2: false, shutdown > srun: debug: IO thread exiting > srun: debug: mpi/pmix_v4: _conn_readable: (null) [0]: pmixp_agent.c:105: > false, shutdown > srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]: > pmixp_agent.c:355: Abort thread exit > srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread > srun: debug2: false, shutdown > srun: debug: Leaving _msg_thr_internal > srun: debug2: spank: spank_pyxis.so: exit = 0 > > > This is the `tail -f` log of slurmctld when launching a simple `srun > hostname` > [2024-09-28T14:08:10.264] ==================== > [2024-09-28T14:08:10.264] JobId=57463 nhosts:1 ncpus:1 node_req:1 > nodes=nodeGPU02 > [2024-09-28T14:08:10.264] Node[0]: > [2024-09-28T14:08:10.264] Mem(MB):65536:0 Sockets:2 Cores:64 CPUs:1:0 > [2024-09-28T14:08:10.264] Socket[0] Core[0] is allocated > [2024-09-28T14:08:10.264] -------------------- > [2024-09-28T14:08:10.264] cpu_array_value[0]:1 reps:1 > [2024-09-28T14:08:10.264] ==================== > [2024-09-28T14:08:10.264] gres/gpu: state for nodeGPU02 > [2024-09-28T14:08:10.264] gres_cnt found:3 configured:3 avail:3 alloc:0 > [2024-09-28T14:08:10.264] gres_bit_alloc: of 3 > [2024-09-28T14:08:10.264] gres_used:(null) > [2024-09-28T14:08:10.264] topo[0]:(null)(0) > [2024-09-28T14:08:10.264] topo_core_bitmap[0]:0-63 of 128 > [2024-09-28T14:08:10.264] topo_gres_bitmap[0]:0 of 3 > [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[0]:0 > [2024-09-28T14:08:10.264] topo_gres_cnt_avail[0]:1 > [2024-09-28T14:08:10.264] topo[1]:(null)(0) > [2024-09-28T14:08:10.264] topo_core_bitmap[1]:0-63 of 128 > [2024-09-28T14:08:10.264] topo_gres_bitmap[1]:1 of 3 > [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[1]:0 > [2024-09-28T14:08:10.264] topo_gres_cnt_avail[1]:1 > [2024-09-28T14:08:10.264] topo[2]:(null)(0) > [2024-09-28T14:08:10.264] topo_core_bitmap[2]:0-63 of 128 > [2024-09-28T14:08:10.264] topo_gres_bitmap[2]:2 of 3 > [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[2]:0 > [2024-09-28T14:08:10.264] topo_gres_cnt_avail[2]:1 > [2024-09-28T14:08:10.265] sched: _slurm_rpc_allocate_resources JobId=57463 > NodeList=nodeGPU02 usec=1339 > [2024-09-28T14:08:10.368] ==================== > [2024-09-28T14:08:10.368] JobId=57463 StepId=0 > [2024-09-28T14:08:10.368] JobNode[0] Socket[0] Core[0] is allocated > [2024-09-28T14:08:10.368] ==================== > [2024-09-28T14:08:30.409] _job_complete: JobId=57463 WTERMSIG 12 > [2024-09-28T14:08:30.410] gres/gpu: state for nodeGPU02 > [2024-09-28T14:08:30.410] gres_cnt found:3 configured:3 avail:3 alloc:0 > [2024-09-28T14:08:30.410] gres_bit_alloc: of 3 > [2024-09-28T14:08:30.410] gres_used:(null) > [2024-09-28T14:08:30.410] topo[0]:(null)(0) > [2024-09-28T14:08:30.410] topo_core_bitmap[0]:0-63 of 128 > [2024-09-28T14:08:30.410] topo_gres_bitmap[0]:0 of 3 > [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[0]:0 > [2024-09-28T14:08:30.410] topo_gres_cnt_avail[0]:1 > [2024-09-28T14:08:30.410] topo[1]:(null)(0) > [2024-09-28T14:08:30.410] topo_core_bitmap[1]:0-63 of 128 > [2024-09-28T14:08:30.410] topo_gres_bitmap[1]:1 of 3 > [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[1]:0 > [2024-09-28T14:08:30.410] topo_gres_cnt_avail[1]:1 > [2024-09-28T14:08:30.410] topo[2]:(null)(0) > [2024-09-28T14:08:30.410] topo_core_bitmap[2]:0-63 of 128 > [2024-09-28T14:08:30.410] topo_gres_bitmap[2]:2 of 3 > [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[2]:0 > [2024-09-28T14:08:30.410] topo_gres_cnt_avail[2]:1 > [2024-09-28T14:08:30.410] _job_complete: JobId=57463 done > [2024-09-28T14:08:58.687] gres/gpu: state for nodeGPU01 > [2024-09-28T14:08:58.687] gres_cnt found:8 configured:8 avail:8 alloc:0 > [2024-09-28T14:08:58.687] gres_bit_alloc: of 8 > [2024-09-28T14:08:58.687] gres_used:(null) > [2024-09-28T14:08:58.687] topo[0]:A100(808464705) > [2024-09-28T14:08:58.687] topo_core_bitmap[0]:48-63 of 128 > [2024-09-28T14:08:58.687] topo_gres_bitmap[0]:0 of 8 > [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[0]:0 > [2024-09-28T14:08:58.687] topo_gres_cnt_avail[0]:1 > [2024-09-28T14:08:58.687] topo[1]:A100(808464705) > [2024-09-28T14:08:58.687] topo_core_bitmap[1]:48-63 of 128 > [2024-09-28T14:08:58.687] topo_gres_bitmap[1]:1 of 8 > [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[1]:0 > [2024-09-28T14:08:58.687] topo_gres_cnt_avail[1]:1 > [2024-09-28T14:08:58.687] topo[2]:A100(808464705) > [2024-09-28T14:08:58.687] topo_core_bitmap[2]:16-31 of 128 > [2024-09-28T14:08:58.687] topo_gres_bitmap[2]:2 of 8 > [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[2]:0 > [2024-09-28T14:08:58.687] topo_gres_cnt_avail[2]:1 > [2024-09-28T14:08:58.687] topo[3]:A100(808464705) > [2024-09-28T14:08:58.687] topo_core_bitmap[3]:16-31 of 128 > [2024-09-28T14:08:58.688] topo_gres_bitmap[3]:3 of 8 > [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[3]:0 > [2024-09-28T14:08:58.688] topo_gres_cnt_avail[3]:1 > [2024-09-28T14:08:58.688] topo[4]:A100(808464705) > [2024-09-28T14:08:58.688] topo_core_bitmap[4]:112-127 of 128 > [2024-09-28T14:08:58.688] topo_gres_bitmap[4]:4 of 8 > [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[4]:0 > [2024-09-28T14:08:58.688] topo_gres_cnt_avail[4]:1 > [2024-09-28T14:08:58.688] topo[5]:A100(808464705) > [2024-09-28T14:08:58.688] topo_core_bitmap[5]:112-127 of 128 > [2024-09-28T14:08:58.688] topo_gres_bitmap[5]:5 of 8 > [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[5]:0 > [2024-09-28T14:08:58.688] topo_gres_cnt_avail[5]:1 > [2024-09-28T14:08:58.688] topo[6]:A100(808464705) > [2024-09-28T14:08:58.688] topo_core_bitmap[6]:80-95 of 128 > [2024-09-28T14:08:58.688] topo_gres_bitmap[6]:6 of 8 > [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[6]:0 > [2024-09-28T14:08:58.688] topo_gres_cnt_avail[6]:1 > [2024-09-28T14:08:58.688] topo[7]:A100(808464705) > [2024-09-28T14:08:58.688] topo_core_bitmap[7]:80-95 of 128 > [2024-09-28T14:08:58.688] topo_gres_bitmap[7]:7 of 8 > [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[7]:0 > [2024-09-28T14:08:58.688] topo_gres_cnt_avail[7]:1 > [2024-09-28T14:08:58.688] type[0]:A100(808464705) > [2024-09-28T14:08:58.688] type_cnt_alloc[0]:0 > [2024-09-28T14:08:58.688] type_cnt_avail[0]:8 > [2024-09-28T14:08:58.690] gres/gpu: state for nodeGPU02 > [2024-09-28T14:08:58.690] gres_cnt found:3 configured:3 avail:3 alloc:0 > [2024-09-28T14:08:58.690] gres_bit_alloc: of 3 > [2024-09-28T14:08:58.690] gres_used:(null) > [2024-09-28T14:08:58.690] topo[0]:(null)(0) > [2024-09-28T14:08:58.690] topo_core_bitmap[0]:0-63 of 128 > [2024-09-28T14:08:58.690] topo_gres_bitmap[0]:0 of 3 > [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[0]:0 > [2024-09-28T14:08:58.690] topo_gres_cnt_avail[0]:1 > [2024-09-28T14:08:58.690] topo[1]:(null)(0) > [2024-09-28T14:08:58.690] topo_core_bitmap[1]:0-63 of 128 > [2024-09-28T14:08:58.690] topo_gres_bitmap[1]:1 of 3 > [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[1]:0 > [2024-09-28T14:08:58.690] topo_gres_cnt_avail[1]:1 > [2024-09-28T14:08:58.690] topo[2]:(null)(0) > [2024-09-28T14:08:58.690] topo_core_bitmap[2]:0-63 of 128 > [2024-09-28T14:08:58.690] topo_gres_bitmap[2]:2 of 3 > [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[2]:0 > [2024-09-28T14:08:58.690] topo_gres_cnt_avail[2]:1 > [2024-09-28T14:09:49.763] Resending TERMINATE_JOB request JobId=57463 > Nodelist=nodeGPU02 > > > This is the `tail -f` log of slurmd when launching the job from master, > notice the messages in yellow > [2024-09-28T14:08:10.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG > [2024-09-28T14:08:10.321] debug2: prep/script: _run_subpath_command: > prolog success rc:0 output: > [2024-09-28T14:08:10.323] debug2: Finish processing RPC: > REQUEST_LAUNCH_PROLOG > [2024-09-28T14:08:10.377] debug: Checking credential with 720 bytes of > sig data > [2024-09-28T14:08:10.377] debug2: Start processing RPC: > REQUEST_LAUNCH_TASKS > [2024-09-28T14:08:10.377] debug2: Processing RPC: REQUEST_LAUNCH_TASKS > [2024-09-28T14:08:10.377] launch task StepId=57463.0 request from > UID:10082 GID:10088 HOST:10.10.0.1 PORT:36478 > [2024-09-28T14:08:10.377] CPU_BIND: JobNode[0] CPU[0] Step alloc > [2024-09-28T14:08:10.377] CPU_BIND: ==================== > [2024-09-28T14:08:10.377] CPU_BIND: Memory extracted from credential for > StepId=57463.0 job_mem_limit=65536 step_mem_limit=65536 > [2024-09-28T14:08:10.377] debug: Waiting for job 57463's prolog to > complete > [2024-09-28T14:08:10.377] debug: Finished wait for job 57463's prolog to > complete > [2024-09-28T14:08:10.378] error: _send_slurmstepd_init failed > [2024-09-28T14:08:10.384] debug2: debug level read from slurmd is 'debug2'. > [2024-09-28T14:08:10.385] debug2: _read_slurmd_conf_lite: slurmd sent 11 > TRES. > [2024-09-28T14:08:10.385] debug2: Received CPU frequency information for > 128 CPUs > [2024-09-28T14:08:10.385] select/cons_tres: common_init: select/cons_tres > loaded > [2024-09-28T14:08:10.385] debug: switch/none: init: switch NONE plugin > loaded > [2024-09-28T14:08:10.385] [57463.0] debug: auth/munge: init: loaded > [2024-09-28T14:08:10.385] [57463.0] debug: Reading cgroup.conf file > /etc/slurm/cgroup.conf > [2024-09-28T14:08:10.395] [57463.0] debug: cgroup/v2: init: Cgroup v2 > plugin loaded > [2024-09-28T14:08:10.396] [57463.0] debug: hash/k12: init: init: > KangarooTwelve hash plugin loaded > [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_energy/none: init: > AcctGatherEnergy NONE plugin loaded > [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_profile/none: > init: AcctGatherProfile NONE plugin loaded > [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_interconnect/none: > init: AcctGatherInterconnect NONE plugin loaded > [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_filesystem/none: > init: AcctGatherFilesystem NONE plugin loaded > [2024-09-28T14:08:10.396] [57463.0] debug2: Reading acct_gather.conf file > /etc/slurm/acct_gather.conf > [2024-09-28T14:08:10.396] [57463.0] debug2: hwloc_topology_init > [2024-09-28T14:08:10.399] [57463.0] debug2: xcpuinfo_hwloc_topo_load: xml > file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found > [2024-09-28T14:08:10.400] [57463.0] debug: CPUs:128 Boards:1 Sockets:2 > CoresPerSocket:64 ThreadsPerCore:1 > [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: core > enforcement enabled > [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: > task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:773744M > allowed:100%(enforced), swap:0%(enforced), max:100%(773744M) > max+swap:0%(773744M) min:30M kmem:100%(773744M permissive) min:30M > [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: memory > enforcement enabled > [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: device > enforcement enabled > [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: Tasks > containment cgroup plugin loaded > [2024-09-28T14:08:10.401] [57463.0] debug: jobacct_gather/linux: init: > Job accounting gather LINUX plugin loaded > [2024-09-28T14:08:10.401] [57463.0] cred/munge: init: Munge credential > signature plugin loaded > [2024-09-28T14:08:10.401] [57463.0] debug: job_container/none: init: > job_container none plugin loaded > [2024-09-28T14:08:10.401] [57463.0] debug: gres/gpu: init: loaded > [2024-09-28T14:08:10.401] [57463.0] debug: gpu/generic: init: init: GPU > Generic plugin loaded > [2024-09-28T14:08:30.415] debug2: Start processing RPC: > REQUEST_TERMINATE_JOB > [2024-09-28T14:08:30.415] debug2: Processing RPC: REQUEST_TERMINATE_JOB > [2024-09-28T14:08:30.415] debug: _rpc_terminate_job: uid = 777 JobId=57463 > [2024-09-28T14:08:30.415] debug: credential for job 57463 revoked > [2024-09-28T14:08:30.415] debug: sent SUCCESS, waiting for step to start > [2024-09-28T14:08:30.415] debug: Blocked waiting for JobId=57463, all > steps > [2024-09-28T14:08:58.688] debug2: Start processing RPC: > REQUEST_NODE_REGISTRATION_STATUS > [2024-09-28T14:08:58.689] debug2: Processing RPC: > REQUEST_NODE_REGISTRATION_STATUS > [2024-09-28T14:08:58.689] debug: _step_connect: connect() failed for > /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused > [2024-09-28T14:08:58.692] debug: _handle_node_reg_resp: slurmctld sent > back 11 TRES. > [2024-09-28T14:08:58.692] debug2: Finish processing RPC: > REQUEST_NODE_REGISTRATION_STATUS > > > -- > Cristóbal A. Navarro >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com