Always hate those odd problems. Glad you are up! Doug On Tue, Jan 25, 2022, 6:43 AM Durai Arasan <arasan.du...@gmail.com> wrote:
> Hello Mike,Doug: > > The issue was resolved somehow. My colleagues says the addresses in > slurm.conf on the login nodes were incorrect. It could also have been a > temporary network issue. > > Best, > Durai Arasan > MPI Tübingen > > On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer <dameye...@gmail.com> wrote: > >> Hi, >> Did you recently add nodes? We have seen that when we add nodes past the >> treewidth count the most recently added nodes will lose communication >> (asterisk next to node name in sifo). We have to ensure the treewidth >> declaration in the slurm.conf matches or exceeds the number of nodes. >> >> Doug >> >> On Fri, Jan 21, 2022 at 4:33 AM Durai Arasan <arasan.du...@gmail.com> >> wrote: >> >>> Hello MIke, >>> >>> I am able to ping the nodes from the slurm master without any problem. >>> Actually there is nothing interesting in slurmctld.log or slurmd.log. You >>> can trust me on this. That is why I posted here. >>> >>> Best, >>> Durai Arasan >>> MPI Tuebingen >>> >>> On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <mrobb...@mines.edu> >>> wrote: >>> >>>> It looks like it could be some kind of network problem but could be >>>> DNS. Can you ping and do DNS resolution for the host involved? >>>> >>>> What does slurmctld.log say? How about slurmd.log on the node in >>>> question? >>>> >>>> >>>> >>>> Mike >>>> >>>> >>>> >>>> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf >>>> of Durai Arasan <arasan.du...@gmail.com> >>>> *Date: *Thursday, January 20, 2022 at 08:08 >>>> *To: *Slurm User Community List <slurm-users@lists.schedmd.com> >>>> *Subject: *[External] Re: [slurm-users] srun : Communication >>>> connection failure >>>> >>>> *CAUTION:* This email originated from outside of the Colorado School >>>> of Mines organization. Do not click on links or open attachments unless you >>>> recognize the sender and know the content is safe. >>>> >>>> >>>> >>>> Hello slurm users, >>>> >>>> >>>> >>>> I forgot to mention that an identical interactive job works >>>> successfully on the gpu partitions (in the same cluster). So this is really >>>> puzzling. >>>> >>>> >>>> >>>> Best, >>>> >>>> Durai Arasan >>>> >>>> MPI Tuebingen >>>> >>>> >>>> >>>> On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.du...@gmail.com> >>>> wrote: >>>> >>>> Hello Slurm users, >>>> >>>> >>>> >>>> We are suddenly encountering strange errors while trying to launch >>>> interactive jobs on our cpu partitions. Have you encountered this problem >>>> before? Kindly let us know. >>>> >>>> >>>> >>>> [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" >>>> --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G >>>> --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash >>>> srun: error: Task launch for StepId=1137134.0 failed on node >>>> slurm-cpu-hm-7: Communication connection failure >>>> srun: error: Application launch failed: Communication connection failure >>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>>> srun: error: Timed out waiting for job step to complete >>>> >>>> >>>> >>>> Best regards, >>>> >>>> Durai Arasan >>>> >>>> MPI Tuebingen >>>> >>>>