Hello,

Thanks again for your documentation, I deployed 24.05.2 last week.
But this weekend slurmctld crashed with only the following in the logs:

"Aug 25 15:33:02 slurmadmin slurmctld[79950]: free(): invalid next size (fast)"

Also, I regularly get these messages in my logs even though these two machines 
are in the same subnet in VMs, and the slurmadmin machine is the same machine 
that runs slurmctld and slurmd, so it cannot lose itself. Meanwhile, all my 
compute nodes are never disconnected.
/var/log/slurm/slurmctld.log:[2024-08-25T14:12:02.009] agent/is_node_resp: 
node:slurmadmin RPC:REQUEST_PING : Communication connection failure
/var/log/slurm/slurmctld.log:[2024-08-25T14:12:02.009] agent/is_node_resp: 
node:vmjupyter RPC:REQUEST_PING : Communication connection failure
/var/log/slurm/slurmctld.log:[2024-08-25T14:12:02.009] agent/is_node_resp: 
node:vmdev RPC:REQUEST_PING : Communication connection failure

Should I open a new topic for this?

Thank you in advance.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to