It looks like it could be some kind of network problem but could be DNS. Can 
you ping and do DNS resolution for the host involved?
What does slurmctld.log say? How about slurmd.log on the node in question?

Mike

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Durai 
Arasan <arasan.du...@gmail.com>
Date: Thursday, January 20, 2022 at 08:08
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [External] Re: [slurm-users] srun : Communication connection failure
CAUTION: This email originated from outside of the Colorado School of Mines 
organization. Do not click on links or open attachments unless you recognize 
the sender and know the content is safe.

Hello slurm users,

I forgot to mention that an identical interactive job works successfully on the 
gpu partitions (in the same cluster). So this is really puzzling.

Best,
Durai Arasan
MPI Tuebingen

On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan 
<arasan.du...@gmail.com<mailto:arasan.du...@gmail.com>> wrote:
Hello Slurm users,

We are suddenly encountering strange errors while trying to launch interactive 
jobs on our cpu partitions. Have you encountered this problem before? Kindly 
let us know.

[darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 
--nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G  
--nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: 
Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Best regards,
Durai Arasan
MPI Tuebingen

Reply via email to