Re: [slurm-users] [External] Re: srun : Communication connection failure

Doug Meyer Tue, 25 Jan 2022 16:21:11 -0800

Always hate those odd problems.  Glad you are up!
Doug

On Tue, Jan 25, 2022, 6:43 AM Durai Arasan <arasan.du...@gmail.com> wrote:


> Hello Mike,Doug:
>
> The issue was resolved somehow. My colleagues says the addresses in
> slurm.conf on the login nodes were incorrect. It could also have been a
> temporary network issue.
>
> Best,
> Durai Arasan
> MPI Tübingen
>
> On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer <dameye...@gmail.com> wrote:
>
>> Hi,
>> Did you recently add nodes?  We have seen that when we add nodes past the
>> treewidth count the most recently added nodes will lose communication
>> (asterisk next to node name in sifo).  We have to ensure the treewidth
>> declaration in the slurm.conf matches or exceeds the number of nodes.
>>
>> Doug
>>
>> On Fri, Jan 21, 2022 at 4:33 AM Durai Arasan <arasan.du...@gmail.com>
>> wrote:
>>
>>> Hello MIke,
>>>
>>> I am able to ping the nodes from the slurm master without any problem.
>>> Actually there is nothing interesting in slurmctld.log or slurmd.log. You
>>> can trust me on this. That is why I posted here.
>>>
>>> Best,
>>> Durai Arasan
>>> MPI Tuebingen
>>>
>>> On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <mrobb...@mines.edu>
>>> wrote:
>>>
>>>> It looks like it could be some kind of network problem but could be
>>>> DNS. Can you ping and do DNS resolution for the host involved?
>>>>
>>>> What does slurmctld.log say? How about slurmd.log on the node in
>>>> question?
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf
>>>> of Durai Arasan <arasan.du...@gmail.com>
>>>> *Date: *Thursday, January 20, 2022 at 08:08
>>>> *To: *Slurm User Community List <slurm-users@lists.schedmd.com>
>>>> *Subject: *[External] Re: [slurm-users] srun : Communication
>>>> connection failure
>>>>
>>>> *CAUTION:* This email originated from outside of the Colorado School
>>>> of Mines organization. Do not click on links or open attachments unless you
>>>> recognize the sender and know the content is safe.
>>>>
>>>>
>>>>
>>>> Hello slurm users,
>>>>
>>>>
>>>>
>>>> I forgot to mention that an identical interactive job works
>>>> successfully on the gpu partitions (in the same cluster). So this is really
>>>> puzzling.
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Durai Arasan
>>>>
>>>> MPI Tuebingen
>>>>
>>>>
>>>>
>>>> On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <arasan.du...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello Slurm users,
>>>>
>>>>
>>>>
>>>> We are suddenly encountering strange errors while trying to launch
>>>> interactive jobs on our cpu partitions. Have you encountered this problem
>>>> before? Kindly let us know.
>>>>
>>>>
>>>>
>>>> [darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231"
>>>> --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G
>>>>  --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
>>>> srun: error: Task launch for StepId=1137134.0 failed on node
>>>> slurm-cpu-hm-7: Communication connection failure
>>>> srun: error: Application launch failed: Communication connection failure
>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>>> srun: error: Timed out waiting for job step to complete
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Durai Arasan
>>>>
>>>> MPI Tuebingen
>>>>
>>>>

Re: [slurm-users] [External] Re: srun : Communication connection failure

Reply via email to