Re: [slurm-users] Submit job using srun fails but sbatch works

Alexander Åhman Wed, 29 May 2019 07:47:37 -0700

I have tried to find a network error but can't see anything. Every nodeI've tested has the same (and correct) view of things.


_On node cn7:_ (the problematic one)
em1: link/ether 50:9a:4c:79:31:4d inet 10.28.3.137/24


_On login machine:_
[alex@li1 ~]$ host cn7
cn7.ydesign.se has address 10.28.3.137
[alex@li1 ~]$ arp cn7

Address HWtype HWaddress FlagsMask Iface

cn7.ydesign.se           ether   50:9a:4c:79:31:4d C                     em1

_On slurmctld machine:_
[alex@cmgr1 ~]$ host cn7
cn7.ydesign.se has address 10.28.3.137
[alex@cmgr1 ~]$ arp cn7

Address HWtype HWaddress FlagsMask Iface

cn7.ydesign.se           ether   50:9a:4c:79:31:4d C                     em1

Yes, I have seen your pages and must say that they have been pure goldon many occasions, thanks a lot Ole! But our cluster is still tiny andthe whole cluster is located in its own network segment. The number ofARP entries is far from 512 (actually, more like ~30).


I just don't understand why sbatch works but not srun?

Could this be some error in the state files perhaps? Something thatmaybe got corrupted when the node (cn7) unexpectedly died?


Regards,
Alexander



Den 2019-05-29 kl. 15:12, skrev Ole Holm Nielsen:

Hi Alexander,
The error "can't find address for host cn7" would indicate a DNSproblem. What is the output of "host cn7" from the srun host li1?
How many network devices are in your subnet? It may be that the Linuxkernel is doing "ARP cache trashing" if the number of devicesapproaches 512. What is the result of "arp cn7"?
To fix ARP cache trashing look in my Slurm Wiki page
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
Best regards,
Ole

On 5/29/19 3:00 PM, Alexander Åhman wrote:
Hi,
Have a very strange problem. The cluster has been working just fineuntil one node died and now I can't submit jobs to 2 of the nodesusing srun from the login machine. Using sbatch works just fine andalso if I use srun from the same host as slurmctld.All the other nodes works just fine as they always has, only 2 nodesare experiencing this problem. Very strange...
Have checked network connectivity and DNS and that is OK. I can ping,ssh to all nodes just fine. All nodes are identical and using Slurm18.08.
Also tested to reboot the 2 nodes and slurmctld but still same problem.

[alex@li1 ~]$ srun -w cn7 hostname
srun: error: fwd_tree_thread: can't find address for host cn7, checkslurm.confsrun: error: Task launch for 1088816.0 failed on node cn7: Can't findan address, check slurm.confsrun: error: Application launch failed: Can't find an address, checkslurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

[alex@li1 ~]$ srun -w cn6 hostname
cn6.ydesign.se
What is this error "can't find address for host" about? Have searchedthe web but can't find any good information about what the problem isor what to do to resolve it.
Any kind soul out there who knows what to do next?

Re: [slurm-users] Submit job using srun fails but sbatch works

Reply via email to