>
> [root@n6 /]# si
>
> PARTITION            NODES NODES(A/I/O/T) S:C:T    MEMORY     TMP_DISK
> TIMELIMIT   AVAIL_FEATURES   NODELIST
>
> debug*               6     0/6/0/6        1:4:2    7785       113264
> infinite    (null)           c[1-6]
>
> (for a moment)
>
> [root@n6 /]# si
>
> PARTITION            NODES NODES(A/I/O/T) S:C:T    MEMORY     TMP_DISK
> TIMELIMIT   AVAIL_FEATURES   NODELIST
>
> debug*               6     0/0/6/6        1:4:2    7785       113264
> infinite    (null)           c[1-6]
>
>



0/0/6/6 means your nodes are dying.

You need to look into the /var/log/slurm/slurmd.log (*or where ever you put
the slurmd logs on the machine, as dictated by
SlurmdLogFile= ) on each of the nodes.

I would predict that there is something wrong with your cgroup.conf

try:

 - confirming that /etc/slurm/cgroup directory exists on all nodes (as per
your cgroup.conf)
 - commenting out everything in cgroup.conf except CgroupAutomount=yes
ConstrainCores=yes

Cheers
L.


------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857

Reply via email to