Hello MIke, I am able to ping the nodes from the slurm master without any problem. Actually there is nothing interesting in slurmctld.log or slurmd.log. You can trust me on this. That is why I posted here.
Best, Durai Arasan MPI Tuebingen On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <mrobb...@mines.edu> wrote: > It looks like it could be some kind of network problem but could be DNS. > Can you ping and do DNS resolution for the host involved? > > What does slurmctld.log say? How about slurmd.log on the node in question? > > > > Mike > > > > *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Durai Arasan <arasan.du...@gmail.com> > *Date: *Thursday, January 20, 2022 at 08:08 > *To: *Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject: *[External] Re: [slurm-users] srun : Communication connection > failure > > *CAUTION:* This email originated from outside of the Colorado School of > Mines organization. Do not click on links or open attachments unless you > recognize the sender and know the content is safe. > > > > Hello slurm users, > > > > I forgot to mention that an identical interactive job works successfully > on the gpu partitions (in the same cluster). So this is really puzzling. > > > > Best, > > Durai Arasan > > MPI Tuebingen > > > > On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.du...@gmail.com> > wrote: > > Hello Slurm users, > > > > We are suddenly encountering strange errors while trying to launch > interactive jobs on our cpu partitions. Have you encountered this problem > before? Kindly let us know. > > > > [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" > --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G > --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash > srun: error: Task launch for StepId=1137134.0 failed on node > slurm-cpu-hm-7: Communication connection failure > srun: error: Application launch failed: Communication connection failure > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete > > > > Best regards, > > Durai Arasan > > MPI Tuebingen > >