Sorry, here's my previous slurm.conf

ProctrackType=proctrack/pgid
#TaskPlugin=task/cgroup
JobacctGatherType=jobacct_gather/linux

And those are the results when I change for each line:

1) ProctrackType=proctrack/pgid --> ProctrackType=proctrack/cgroup
[root@n6 /]# srun -N5 hostname
srun: error: Task launch for 2.0 failed on node c1: Communication
connection failure
srun: error: Task launch for 2.0 failed on node c4: Communication
connection failure
srun: error: Task launch for 2.0 failed on node c2: Communication
connection failure
srun: error: Task launch for 2.0 failed on node c5: Communication
connection failure
srun: error: Task launch for 2.0 failed on node c3: Communication
connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
[root@n6 /]# si
PARTITION            NODES NODES(A/I/O/T) S:C:T    MEMORY     TMP_DISK
TIMELIMIT   AVAIL_FEATURES   NODELIST
debug*               6     5/1/0/6        1:4:2    7785       113264
infinite    (null)           c[1-6]


2) #TaskPlugin=task/cgroup --> TaskPlugin=task/cgroup (remove comment)
Works fine!:
[root@n6 /]# srun -N5 hostname
n4
n5
n3
n1
n2

3) JobacctGatherType=jobacct_gather/linux -->
JobacctGatherType=jobacct_gather/cgroup
Autodown:

[root@n6 /]# si

PARTITION            NODES NODES(A/I/O/T) S:C:T    MEMORY     TMP_DISK
TIMELIMIT   AVAIL_FEATURES   NODELIST

debug*               6     0/6/0/6        1:4:2    7785       113264
infinite    (null)           c[1-6]

(for a moment)

[root@n6 /]# si

PARTITION            NODES NODES(A/I/O/T) S:C:T    MEMORY     TMP_DISK
TIMELIMIT   AVAIL_FEATURES   NODELIST

debug*               6     0/0/6/6        1:4:2    7785       113264
infinite    (null)           c[1-6]


Which point should I check first?





Sumin Han
Undergraduate '13, School of Computing
Korea Advanced Institute of Science and Technology
Daehak-ro 291
Yuseong-gu, Daejeon
Republic of Korea 305-701
Tel. +82-10-2075-6911

2017-08-02 12:01 GMT+09:00 Lachlan Musicman <[email protected]>:

> Sumin,
>
> The error message is saying that the node is down.
>
> When you say "works with sinfo", you need to show us what that means -
> sinfo is a command that interrogates the state of nodes, whereas srun sends
> commands *to* nodes. So sinfo is meant to work - even if the nodes are
> down. It is hte software that will tell you that the state the nodes are in.
>
> What is the output of sinfo?
>
>
> cheers
> L.
>
> ------
> "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
> civics is the insistence that we cannot ignore the truth, nor should we
> panic about it. It is a shared consciousness that our institutions have
> failed and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> *Greg Bloom* @greggish https://twitter.com/greggish/
> status/873177525903609857
>
> On 2 August 2017 at 12:19, 한수민 <[email protected]> wrote:
>
>> I succeeded to set up the basic environment to use slurm. Also I've added
>> these to each file.
>>
>> /etc/slurm/slurm.conf:
>>
>> ProctrackType=proctrack/cgroup
>> TaskPlugin=task/cgroup
>> JobacctGatherType=jobacct_gather/cgroup
>>
>> /etc/slurm/cgroup.conf
>>
>> ###
>> # Slurm cgroup support configuration file
>> ###
>> CgroupAutomount=yes
>> CgroupReleaseAgentDir="/etc/slurm/cgroup"
>> ConstrainCores=yes
>> TaskAffinity=yes
>> #
>>
>> It also works with "sinfo" but not with "srun".
>> i.e.:
>> [root@n6 /]# srun hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 3 queued and waiting for resources
>>
>> Could you give me any advice?
>>
>> Sumin Han
>> Undergraduate '13, School of Computing
>> Korea Advanced Institute of Science and Technology
>> Daehak-ro 291
>> Yuseong-gu, Daejeon
>> Republic of Korea 305-701
>> Tel. +82-10-2075-6911 <+82%2010-2075-6911>
>>
>
>

Reply via email to