[slurm-users] Re: Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

Cristóbal Navarro via slurm-users Sun, 29 Sep 2024 11:16:56 -0700

Update,
So finally problem solved. In case someone runs into something similar, the
issue was that there were file remains in /var/spool/slurmd/ from the
previous ubuntu version (22.04), for some reason that made slurmd behave
like that. So deleting all those files inside solved the problem!



On Sat, Sep 28, 2024, 2:13 PM Cristóbal Navarro <
cristobal.navarr...@gmail.com> wrote:

> Dear community,
> I am having a strange issue I have been unable to find the cause. Last
> week I did a full update on the cluster, which is composed of the master
> node, and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom
> GPU server). After the update, I got
>
>    - master node ended up with Ubuntu 24.04,
>    - nodeGPU01 with latest DGX OS (still Ubuntu 22.04)
>    - nodeGPU02 with Ubuntu 24.04 LTS.
>    - Launching jobs from master choosing the partitions of nodeGPU01
>    works perfectly.
>    - Launching jobs from master choosing the partition of nodeGPU02
>    stopped working (hangs).
>
> The nodeGPU02 (Ubuntu 24) is no longer processing jobs successfully, while
> the other nodeGPU01 works perfectly even when the master has Ubuntu 24.
> Any help is welcome, I have tried many things and had no success in
> finding the cause of this. Please let me know if you need more information.
> Many thanks in advance.
>
> This is the initial `slurmd` log of the problematic node (nodeGPU02),
> notice the messages in yellow
>
> ➜  ~ sudo systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
>      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; preset:
> enabled)
>      Active: active (running) since Sat 2024-09-28 14:00:22 -03; 4s ago
>    Main PID: 4821 (slurmd)
>       Tasks: 1
>      Memory: 17.0M (peak: 29.7M)
>         CPU: 174ms
>      CGroup: /system.slice/slurmd.service
>              └─4821 /usr/sbin/slurmd -D -s
>
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:  MPI: Loading all
> types
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:  mpi/pmix_v5: init:
> PMIx plugin loaded
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:  mpi/pmix_v5: init:
> PMIx plugin loaded
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: No mpi.conf file
> (/etc/slurm/mpi.conf)
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: slurmd started on Sat, 28
> Sep 2024 14:00:25 -0300
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:  _step_connect:
> connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection
> refused
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: health_check
> success rc:0 output:
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: CPUs=128 Boards=1
> Sockets=2 Cores=64 Threads=1 Memory=773744 TmpDisk=899181 Uptime=2829
> CPUSpecList=(null) FeaturesAvail=(nu>
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:  _step_connect:
> connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection
> refused
> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug:
>  _handle_node_reg_resp: slurmctld sent back 11 TRES
>
> This is the verbose output of the srun command (notice yellow messages).
> ➜  ~ srun -vvvp rtx hostname
> srun: defined options
> srun: -------------------- --------------------
> srun: partition           : rtx
> srun: verbose             : 3
> srun: -------------------- --------------------
> srun: end of defined options
> srun: debug:  propagating RLIMIT_CPU=18446744073709551615
> srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
> srun: debug:  propagating RLIMIT_DATA=18446744073709551615
> srun: debug:  propagating RLIMIT_STACK=8388608
> srun: debug:  propagating RLIMIT_CORE=0
> srun: debug:  propagating RLIMIT_RSS=18446744073709551615
> srun: debug:  propagating RLIMIT_NPROC=3090276
> srun: debug:  propagating RLIMIT_NOFILE=1024
> srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
> srun: debug:  propagating RLIMIT_AS=18446744073709551615
> srun: debug:  propagating SLURM_PRIO_PROCESS=0
> srun: debug:  propagating UMASK=0002
> srun: debug:  Entering slurm_allocation_msg_thr_create()
> srun: debug:  port from net_stream_listen is 34081
> srun: debug:  Entering _msg_thr_internal
> srun: Waiting for resource configuration
> srun: Nodes nodeGPU02 are ready for job
> srun: jobid 57463: nodes(1):`nodeGPU02', cpu counts: 1(x1)
> srun: debug2: creating job with 1 tasks
> srun: debug2: cpu:1 is not a gres:
> srun: debug:  requesting job 57463, user 99, nodes 1 including ((null))
> srun: debug:  cpus 1, tasks 1, name hostname, relative 65534
> srun: CpuBindType=(null type)
> srun: debug:  Entering slurm_step_launch
> srun: debug:  mpi/pmix_v4: pmixp_abort_agent_start: (null) [0]:
> pmixp_agent.c:382: Abort agent port: 41393
> srun: debug:  mpi/pmix_v4: mpi_p_client_prelaunch: (null) [0]:
> mpi_pmix.c:285: setup process mapping in srun
> srun: debug:  Entering _msg_thr_create()
> srun: debug:  mpi/pmix_v4: _pmix_abort_thread: (null) [0]:
> pmixp_agent.c:353: Start abort thread
> srun: debug:  initialized stdio listening socket, port 33223
> srun: debug:  Started IO server thread (140079189182144)
> srun: debug:  Entering _launch_tasks
> srun: launching StepId=57463.0 on host nodeGPU02, 1 tasks: 0
> srun: debug2: Called _file_readable
> srun: debug2: Called _file_writable
> srun: route/default: init: route default plugin loaded
> srun: debug2: Called _file_writable
> srun: topology/none: init: topology NONE plugin loaded
> srun: debug2: Tree head got back 0 looking for 1
> srun: debug:  slurm_recv_timeout at 0 of 4, timeout
> srun: error: slurm_receive_msgs: [[nodeGPU02]:6818] failed: Socket timed
> out on send/recv operation
> srun: debug2: Tree head got back 1
> srun: debug:  launch returned msg_rc=1001 err=5004 type=9001
> srun: debug2: marking task 0 done on failed node 0
> srun: error: Task launch for StepId=57463.0 failed on node nodeGPU02:
> Socket timed out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv
> operation
> srun: Job step aborted
> srun: debug2:   false, shutdown
> srun: debug2:   false, shutdown
> srun: debug2: Called _file_readable
> srun: debug2: Called _file_writable
> srun: debug2: Called _file_writable
> srun: debug2:   false, shutdown
> srun: debug:  IO thread exiting
> srun: debug:  mpi/pmix_v4: _conn_readable: (null) [0]: pmixp_agent.c:105:
>     false, shutdown
> srun: debug:  mpi/pmix_v4: _pmix_abort_thread: (null) [0]:
> pmixp_agent.c:355: Abort thread exit
> srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
> srun: debug2:   false, shutdown
> srun: debug:  Leaving _msg_thr_internal
> srun: debug2: spank: spank_pyxis.so: exit = 0
>
>
> This is the `tail -f` log of slurmctld when launching a simple `srun
> hostname`
> [2024-09-28T14:08:10.264] ====================
> [2024-09-28T14:08:10.264] JobId=57463 nhosts:1 ncpus:1 node_req:1
> nodes=nodeGPU02
> [2024-09-28T14:08:10.264] Node[0]:
> [2024-09-28T14:08:10.264]   Mem(MB):65536:0  Sockets:2  Cores:64  CPUs:1:0
> [2024-09-28T14:08:10.264]   Socket[0] Core[0] is allocated
> [2024-09-28T14:08:10.264] --------------------
> [2024-09-28T14:08:10.264] cpu_array_value[0]:1 reps:1
> [2024-09-28T14:08:10.264] ====================
> [2024-09-28T14:08:10.264] gres/gpu: state for nodeGPU02
> [2024-09-28T14:08:10.264]   gres_cnt found:3 configured:3 avail:3 alloc:0
> [2024-09-28T14:08:10.264]   gres_bit_alloc: of 3
> [2024-09-28T14:08:10.264]   gres_used:(null)
> [2024-09-28T14:08:10.264]   topo[0]:(null)(0)
> [2024-09-28T14:08:10.264]    topo_core_bitmap[0]:0-63 of 128
> [2024-09-28T14:08:10.264]    topo_gres_bitmap[0]:0 of 3
> [2024-09-28T14:08:10.264]    topo_gres_cnt_alloc[0]:0
> [2024-09-28T14:08:10.264]    topo_gres_cnt_avail[0]:1
> [2024-09-28T14:08:10.264]   topo[1]:(null)(0)
> [2024-09-28T14:08:10.264]    topo_core_bitmap[1]:0-63 of 128
> [2024-09-28T14:08:10.264]    topo_gres_bitmap[1]:1 of 3
> [2024-09-28T14:08:10.264]    topo_gres_cnt_alloc[1]:0
> [2024-09-28T14:08:10.264]    topo_gres_cnt_avail[1]:1
> [2024-09-28T14:08:10.264]   topo[2]:(null)(0)
> [2024-09-28T14:08:10.264]    topo_core_bitmap[2]:0-63 of 128
> [2024-09-28T14:08:10.264]    topo_gres_bitmap[2]:2 of 3
> [2024-09-28T14:08:10.264]    topo_gres_cnt_alloc[2]:0
> [2024-09-28T14:08:10.264]    topo_gres_cnt_avail[2]:1
> [2024-09-28T14:08:10.265] sched: _slurm_rpc_allocate_resources JobId=57463
> NodeList=nodeGPU02 usec=1339
> [2024-09-28T14:08:10.368] ====================
> [2024-09-28T14:08:10.368] JobId=57463 StepId=0
> [2024-09-28T14:08:10.368] JobNode[0] Socket[0] Core[0] is allocated
> [2024-09-28T14:08:10.368] ====================
> [2024-09-28T14:08:30.409] _job_complete: JobId=57463 WTERMSIG 12
> [2024-09-28T14:08:30.410] gres/gpu: state for nodeGPU02
> [2024-09-28T14:08:30.410]   gres_cnt found:3 configured:3 avail:3 alloc:0
> [2024-09-28T14:08:30.410]   gres_bit_alloc: of 3
> [2024-09-28T14:08:30.410]   gres_used:(null)
> [2024-09-28T14:08:30.410]   topo[0]:(null)(0)
> [2024-09-28T14:08:30.410]    topo_core_bitmap[0]:0-63 of 128
> [2024-09-28T14:08:30.410]    topo_gres_bitmap[0]:0 of 3
> [2024-09-28T14:08:30.410]    topo_gres_cnt_alloc[0]:0
> [2024-09-28T14:08:30.410]    topo_gres_cnt_avail[0]:1
> [2024-09-28T14:08:30.410]   topo[1]:(null)(0)
> [2024-09-28T14:08:30.410]    topo_core_bitmap[1]:0-63 of 128
> [2024-09-28T14:08:30.410]    topo_gres_bitmap[1]:1 of 3
> [2024-09-28T14:08:30.410]    topo_gres_cnt_alloc[1]:0
> [2024-09-28T14:08:30.410]    topo_gres_cnt_avail[1]:1
> [2024-09-28T14:08:30.410]   topo[2]:(null)(0)
> [2024-09-28T14:08:30.410]    topo_core_bitmap[2]:0-63 of 128
> [2024-09-28T14:08:30.410]    topo_gres_bitmap[2]:2 of 3
> [2024-09-28T14:08:30.410]    topo_gres_cnt_alloc[2]:0
> [2024-09-28T14:08:30.410]    topo_gres_cnt_avail[2]:1
> [2024-09-28T14:08:30.410] _job_complete: JobId=57463 done
> [2024-09-28T14:08:58.687] gres/gpu: state for nodeGPU01
> [2024-09-28T14:08:58.687]   gres_cnt found:8 configured:8 avail:8 alloc:0
> [2024-09-28T14:08:58.687]   gres_bit_alloc: of 8
> [2024-09-28T14:08:58.687]   gres_used:(null)
> [2024-09-28T14:08:58.687]   topo[0]:A100(808464705)
> [2024-09-28T14:08:58.687]    topo_core_bitmap[0]:48-63 of 128
> [2024-09-28T14:08:58.687]    topo_gres_bitmap[0]:0 of 8
> [2024-09-28T14:08:58.687]    topo_gres_cnt_alloc[0]:0
> [2024-09-28T14:08:58.687]    topo_gres_cnt_avail[0]:1
> [2024-09-28T14:08:58.687]   topo[1]:A100(808464705)
> [2024-09-28T14:08:58.687]    topo_core_bitmap[1]:48-63 of 128
> [2024-09-28T14:08:58.687]    topo_gres_bitmap[1]:1 of 8
> [2024-09-28T14:08:58.687]    topo_gres_cnt_alloc[1]:0
> [2024-09-28T14:08:58.687]    topo_gres_cnt_avail[1]:1
> [2024-09-28T14:08:58.687]   topo[2]:A100(808464705)
> [2024-09-28T14:08:58.687]    topo_core_bitmap[2]:16-31 of 128
> [2024-09-28T14:08:58.687]    topo_gres_bitmap[2]:2 of 8
> [2024-09-28T14:08:58.687]    topo_gres_cnt_alloc[2]:0
> [2024-09-28T14:08:58.687]    topo_gres_cnt_avail[2]:1
> [2024-09-28T14:08:58.687]   topo[3]:A100(808464705)
> [2024-09-28T14:08:58.687]    topo_core_bitmap[3]:16-31 of 128
> [2024-09-28T14:08:58.688]    topo_gres_bitmap[3]:3 of 8
> [2024-09-28T14:08:58.688]    topo_gres_cnt_alloc[3]:0
> [2024-09-28T14:08:58.688]    topo_gres_cnt_avail[3]:1
> [2024-09-28T14:08:58.688]   topo[4]:A100(808464705)
> [2024-09-28T14:08:58.688]    topo_core_bitmap[4]:112-127 of 128
> [2024-09-28T14:08:58.688]    topo_gres_bitmap[4]:4 of 8
> [2024-09-28T14:08:58.688]    topo_gres_cnt_alloc[4]:0
> [2024-09-28T14:08:58.688]    topo_gres_cnt_avail[4]:1
> [2024-09-28T14:08:58.688]   topo[5]:A100(808464705)
> [2024-09-28T14:08:58.688]    topo_core_bitmap[5]:112-127 of 128
> [2024-09-28T14:08:58.688]    topo_gres_bitmap[5]:5 of 8
> [2024-09-28T14:08:58.688]    topo_gres_cnt_alloc[5]:0
> [2024-09-28T14:08:58.688]    topo_gres_cnt_avail[5]:1
> [2024-09-28T14:08:58.688]   topo[6]:A100(808464705)
> [2024-09-28T14:08:58.688]    topo_core_bitmap[6]:80-95 of 128
> [2024-09-28T14:08:58.688]    topo_gres_bitmap[6]:6 of 8
> [2024-09-28T14:08:58.688]    topo_gres_cnt_alloc[6]:0
> [2024-09-28T14:08:58.688]    topo_gres_cnt_avail[6]:1
> [2024-09-28T14:08:58.688]   topo[7]:A100(808464705)
> [2024-09-28T14:08:58.688]    topo_core_bitmap[7]:80-95 of 128
> [2024-09-28T14:08:58.688]    topo_gres_bitmap[7]:7 of 8
> [2024-09-28T14:08:58.688]    topo_gres_cnt_alloc[7]:0
> [2024-09-28T14:08:58.688]    topo_gres_cnt_avail[7]:1
> [2024-09-28T14:08:58.688]   type[0]:A100(808464705)
> [2024-09-28T14:08:58.688]    type_cnt_alloc[0]:0
> [2024-09-28T14:08:58.688]    type_cnt_avail[0]:8
> [2024-09-28T14:08:58.690] gres/gpu: state for nodeGPU02
> [2024-09-28T14:08:58.690]   gres_cnt found:3 configured:3 avail:3 alloc:0
> [2024-09-28T14:08:58.690]   gres_bit_alloc: of 3
> [2024-09-28T14:08:58.690]   gres_used:(null)
> [2024-09-28T14:08:58.690]   topo[0]:(null)(0)
> [2024-09-28T14:08:58.690]    topo_core_bitmap[0]:0-63 of 128
> [2024-09-28T14:08:58.690]    topo_gres_bitmap[0]:0 of 3
> [2024-09-28T14:08:58.690]    topo_gres_cnt_alloc[0]:0
> [2024-09-28T14:08:58.690]    topo_gres_cnt_avail[0]:1
> [2024-09-28T14:08:58.690]   topo[1]:(null)(0)
> [2024-09-28T14:08:58.690]    topo_core_bitmap[1]:0-63 of 128
> [2024-09-28T14:08:58.690]    topo_gres_bitmap[1]:1 of 3
> [2024-09-28T14:08:58.690]    topo_gres_cnt_alloc[1]:0
> [2024-09-28T14:08:58.690]    topo_gres_cnt_avail[1]:1
> [2024-09-28T14:08:58.690]   topo[2]:(null)(0)
> [2024-09-28T14:08:58.690]    topo_core_bitmap[2]:0-63 of 128
> [2024-09-28T14:08:58.690]    topo_gres_bitmap[2]:2 of 3
> [2024-09-28T14:08:58.690]    topo_gres_cnt_alloc[2]:0
> [2024-09-28T14:08:58.690]    topo_gres_cnt_avail[2]:1
> [2024-09-28T14:09:49.763] Resending TERMINATE_JOB request JobId=57463
> Nodelist=nodeGPU02
>
>
> This is the `tail -f` log of slurmd when launching the job from master,
> notice the messages in yellow
> [2024-09-28T14:08:10.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
> [2024-09-28T14:08:10.321] debug2: prep/script: _run_subpath_command:
> prolog success rc:0 output:
> [2024-09-28T14:08:10.323] debug2: Finish processing RPC:
> REQUEST_LAUNCH_PROLOG
> [2024-09-28T14:08:10.377] debug:  Checking credential with 720 bytes of
> sig data
> [2024-09-28T14:08:10.377] debug2: Start processing RPC:
> REQUEST_LAUNCH_TASKS
> [2024-09-28T14:08:10.377] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
> [2024-09-28T14:08:10.377] launch task StepId=57463.0 request from
> UID:10082 GID:10088 HOST:10.10.0.1 PORT:36478
> [2024-09-28T14:08:10.377] CPU_BIND: JobNode[0] CPU[0] Step alloc
> [2024-09-28T14:08:10.377] CPU_BIND: ====================
> [2024-09-28T14:08:10.377] CPU_BIND: Memory extracted from credential for
> StepId=57463.0 job_mem_limit=65536 step_mem_limit=65536
> [2024-09-28T14:08:10.377] debug:  Waiting for job 57463's prolog to
> complete
> [2024-09-28T14:08:10.377] debug:  Finished wait for job 57463's prolog to
> complete
> [2024-09-28T14:08:10.378] error: _send_slurmstepd_init failed
> [2024-09-28T14:08:10.384] debug2: debug level read from slurmd is 'debug2'.
> [2024-09-28T14:08:10.385] debug2: _read_slurmd_conf_lite: slurmd sent 11
> TRES.
> [2024-09-28T14:08:10.385] debug2: Received CPU frequency information for
> 128 CPUs
> [2024-09-28T14:08:10.385] select/cons_tres: common_init: select/cons_tres
> loaded
> [2024-09-28T14:08:10.385] debug:  switch/none: init: switch NONE plugin
> loaded
> [2024-09-28T14:08:10.385] [57463.0] debug:  auth/munge: init: loaded
> [2024-09-28T14:08:10.385] [57463.0] debug:  Reading cgroup.conf file
> /etc/slurm/cgroup.conf
> [2024-09-28T14:08:10.395] [57463.0] debug:  cgroup/v2: init: Cgroup v2
> plugin loaded
> [2024-09-28T14:08:10.396] [57463.0] debug:  hash/k12: init: init:
> KangarooTwelve hash plugin loaded
> [2024-09-28T14:08:10.396] [57463.0] debug:  acct_gather_energy/none: init:
> AcctGatherEnergy NONE plugin loaded
> [2024-09-28T14:08:10.396] [57463.0] debug:  acct_gather_profile/none:
> init: AcctGatherProfile NONE plugin loaded
> [2024-09-28T14:08:10.396] [57463.0] debug:  acct_gather_interconnect/none:
> init: AcctGatherInterconnect NONE plugin loaded
> [2024-09-28T14:08:10.396] [57463.0] debug:  acct_gather_filesystem/none:
> init: AcctGatherFilesystem NONE plugin loaded
> [2024-09-28T14:08:10.396] [57463.0] debug2: Reading acct_gather.conf file
> /etc/slurm/acct_gather.conf
> [2024-09-28T14:08:10.396] [57463.0] debug2: hwloc_topology_init
> [2024-09-28T14:08:10.399] [57463.0] debug2: xcpuinfo_hwloc_topo_load: xml
> file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
> [2024-09-28T14:08:10.400] [57463.0] debug:  CPUs:128 Boards:1 Sockets:2
> CoresPerSocket:64 ThreadsPerCore:1
> [2024-09-28T14:08:10.401] [57463.0] debug:  task/cgroup: init: core
> enforcement enabled
> [2024-09-28T14:08:10.401] [57463.0] debug:  task/cgroup:
> task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:773744M
> allowed:100%(enforced), swap:0%(enforced), max:100%(773744M)
> max+swap:0%(773744M) min:30M kmem:100%(773744M permissive) min:30M
> [2024-09-28T14:08:10.401] [57463.0] debug:  task/cgroup: init: memory
> enforcement enabled
> [2024-09-28T14:08:10.401] [57463.0] debug:  task/cgroup: init: device
> enforcement enabled
> [2024-09-28T14:08:10.401] [57463.0] debug:  task/cgroup: init: Tasks
> containment cgroup plugin loaded
> [2024-09-28T14:08:10.401] [57463.0] debug:  jobacct_gather/linux: init:
> Job accounting gather LINUX plugin loaded
> [2024-09-28T14:08:10.401] [57463.0] cred/munge: init: Munge credential
> signature plugin loaded
> [2024-09-28T14:08:10.401] [57463.0] debug:  job_container/none: init:
> job_container none plugin loaded
> [2024-09-28T14:08:10.401] [57463.0] debug:  gres/gpu: init: loaded
> [2024-09-28T14:08:10.401] [57463.0] debug:  gpu/generic: init: init: GPU
> Generic plugin loaded
> [2024-09-28T14:08:30.415] debug2: Start processing RPC:
> REQUEST_TERMINATE_JOB
> [2024-09-28T14:08:30.415] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> [2024-09-28T14:08:30.415] debug:  _rpc_terminate_job: uid = 777 JobId=57463
> [2024-09-28T14:08:30.415] debug:  credential for job 57463 revoked
> [2024-09-28T14:08:30.415] debug:  sent SUCCESS, waiting for step to start
> [2024-09-28T14:08:30.415] debug:  Blocked waiting for JobId=57463, all
> steps
> [2024-09-28T14:08:58.688] debug2: Start processing RPC:
> REQUEST_NODE_REGISTRATION_STATUS
> [2024-09-28T14:08:58.689] debug2: Processing RPC:
> REQUEST_NODE_REGISTRATION_STATUS
> [2024-09-28T14:08:58.689] debug:  _step_connect: connect() failed for
> /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused
> [2024-09-28T14:08:58.692] debug:  _handle_node_reg_resp: slurmctld sent
> back 11 TRES.
> [2024-09-28T14:08:58.692] debug2: Finish processing RPC:
> REQUEST_NODE_REGISTRATION_STATUS
>
>
> --
> Cristóbal A. Navarro
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

Reply via email to