Hello Mike,Doug: The issue was resolved somehow. My colleagues says the addresses in slurm.conf on the login nodes were incorrect. It could also have been a temporary network issue.
Best, Durai Arasan MPI Tübingen On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer <dameye...@gmail.com> wrote: > Hi, > Did you recently add nodes? We have seen that when we add nodes past the > treewidth count the most recently added nodes will lose communication > (asterisk next to node name in sifo). We have to ensure the treewidth > declaration in the slurm.conf matches or exceeds the number of nodes. > > Doug > > On Fri, Jan 21, 2022 at 4:33 AM Durai Arasan <arasan.du...@gmail.com> > wrote: > >> Hello MIke, >> >> I am able to ping the nodes from the slurm master without any problem. >> Actually there is nothing interesting in slurmctld.log or slurmd.log. You >> can trust me on this. That is why I posted here. >> >> Best, >> Durai Arasan >> MPI Tuebingen >> >> On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <mrobb...@mines.edu> >> wrote: >> >>> It looks like it could be some kind of network problem but could be DNS. >>> Can you ping and do DNS resolution for the host involved? >>> >>> What does slurmctld.log say? How about slurmd.log on the node in >>> question? >>> >>> >>> >>> Mike >>> >>> >>> >>> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf >>> of Durai Arasan <arasan.du...@gmail.com> >>> *Date: *Thursday, January 20, 2022 at 08:08 >>> *To: *Slurm User Community List <slurm-users@lists.schedmd.com> >>> *Subject: *[External] Re: [slurm-users] srun : Communication connection >>> failure >>> >>> *CAUTION:* This email originated from outside of the Colorado School of >>> Mines organization. Do not click on links or open attachments unless you >>> recognize the sender and know the content is safe. >>> >>> >>> >>> Hello slurm users, >>> >>> >>> >>> I forgot to mention that an identical interactive job works successfully >>> on the gpu partitions (in the same cluster). So this is really puzzling. >>> >>> >>> >>> Best, >>> >>> Durai Arasan >>> >>> MPI Tuebingen >>> >>> >>> >>> On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.du...@gmail.com> >>> wrote: >>> >>> Hello Slurm users, >>> >>> >>> >>> We are suddenly encountering strange errors while trying to launch >>> interactive jobs on our cpu partitions. Have you encountered this problem >>> before? Kindly let us know. >>> >>> >>> >>> [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" >>> --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G >>> --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash >>> srun: error: Task launch for StepId=1137134.0 failed on node >>> slurm-cpu-hm-7: Communication connection failure >>> srun: error: Application launch failed: Communication connection failure >>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>> srun: error: Timed out waiting for job step to complete >>> >>> >>> >>> Best regards, >>> >>> Durai Arasan >>> >>> MPI Tuebingen >>> >>>