Sorry, here's my previous slurm.conf ProctrackType=proctrack/pgid #TaskPlugin=task/cgroup JobacctGatherType=jobacct_gather/linux
And those are the results when I change for each line: 1) ProctrackType=proctrack/pgid --> ProctrackType=proctrack/cgroup [root@n6 /]# srun -N5 hostname srun: error: Task launch for 2.0 failed on node c1: Communication connection failure srun: error: Task launch for 2.0 failed on node c4: Communication connection failure srun: error: Task launch for 2.0 failed on node c2: Communication connection failure srun: error: Task launch for 2.0 failed on node c5: Communication connection failure srun: error: Task launch for 2.0 failed on node c3: Communication connection failure srun: error: Application launch failed: Communication connection failure srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete [root@n6 /]# si PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY TMP_DISK TIMELIMIT AVAIL_FEATURES NODELIST debug* 6 5/1/0/6 1:4:2 7785 113264 infinite (null) c[1-6] 2) #TaskPlugin=task/cgroup --> TaskPlugin=task/cgroup (remove comment) Works fine!: [root@n6 /]# srun -N5 hostname n4 n5 n3 n1 n2 3) JobacctGatherType=jobacct_gather/linux --> JobacctGatherType=jobacct_gather/cgroup Autodown: [root@n6 /]# si PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY TMP_DISK TIMELIMIT AVAIL_FEATURES NODELIST debug* 6 0/6/0/6 1:4:2 7785 113264 infinite (null) c[1-6] (for a moment) [root@n6 /]# si PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY TMP_DISK TIMELIMIT AVAIL_FEATURES NODELIST debug* 6 0/0/6/6 1:4:2 7785 113264 infinite (null) c[1-6] Which point should I check first? Sumin Han Undergraduate '13, School of Computing Korea Advanced Institute of Science and Technology Daehak-ro 291 Yuseong-gu, Daejeon Republic of Korea 305-701 Tel. +82-10-2075-6911 2017-08-02 12:01 GMT+09:00 Lachlan Musicman <[email protected]>: > Sumin, > > The error message is saying that the node is down. > > When you say "works with sinfo", you need to show us what that means - > sinfo is a command that interrogates the state of nodes, whereas srun sends > commands *to* nodes. So sinfo is meant to work - even if the nodes are > down. It is hte software that will tell you that the state the nodes are in. > > What is the output of sinfo? > > > cheers > L. > > ------ > "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic > civics is the insistence that we cannot ignore the truth, nor should we > panic about it. It is a shared consciousness that our institutions have > failed and our ecosystem is collapsing, yet we are still here — and we are > creative agents who can shape our destinies. Apocalyptic civics is the > conviction that the only way out is through, and the only way through is > together. " > > *Greg Bloom* @greggish https://twitter.com/greggish/ > status/873177525903609857 > > On 2 August 2017 at 12:19, 한수민 <[email protected]> wrote: > >> I succeeded to set up the basic environment to use slurm. Also I've added >> these to each file. >> >> /etc/slurm/slurm.conf: >> >> ProctrackType=proctrack/cgroup >> TaskPlugin=task/cgroup >> JobacctGatherType=jobacct_gather/cgroup >> >> /etc/slurm/cgroup.conf >> >> ### >> # Slurm cgroup support configuration file >> ### >> CgroupAutomount=yes >> CgroupReleaseAgentDir="/etc/slurm/cgroup" >> ConstrainCores=yes >> TaskAffinity=yes >> # >> >> It also works with "sinfo" but not with "srun". >> i.e.: >> [root@n6 /]# srun hostname >> srun: Required node not available (down, drained or reserved) >> srun: job 3 queued and waiting for resources >> >> Could you give me any advice? >> >> Sumin Han >> Undergraduate '13, School of Computing >> Korea Advanced Institute of Science and Technology >> Daehak-ro 291 >> Yuseong-gu, Daejeon >> Republic of Korea 305-701 >> Tel. +82-10-2075-6911 <+82%2010-2075-6911> >> > >
