That was my first thought too, but... no. Both /etc/hosts (not used) and slurm.conf are identical on all nodes, both working and non-working nodes.

_From login machine:_
[alex@li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118071 queued and waiting for resources
srun: job 1118071 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host cn7, check slurm.conf srun: error: Task launch for 1118071.0 failed on node cn7: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

_From slurmctld machine:_
[root@cmgr1 ~]# srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118076 queued and waiting for resources
srun: job 1118076 has been allocated resources
PING cn7.ydesign.se (10.28.3.137) 56(84) bytes of data.
64 bytes from cn7.ydesign.se (10.28.3.137): icmp_seq=1 ttl=64 time=0.012 ms

--- cn7.ydesign.se ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.012/0.012/0.012/0.000 ms


I guess that some state file somewhere got corrupted. Think the new mission will be to try to reset the correct state file and try again or if that fails - clean it with fire! ;-)

Regards,
Alexander Åhman



Den 2019-05-29 kl. 19:23, skrev Alex Chekholko:
I think this error usually means that on your node cn7 it has either the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf

E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'

On Wed, May 29, 2019 at 6:00 AM Alexander Åhman <alexan...@ydesign.se <mailto:alexan...@ydesign.se>> wrote:

    Hi,
    Have a very strange problem. The cluster has been working just fine
    until one node died and now I can't submit jobs to 2 of the nodes
    using
    srun from the login machine. Using sbatch works just fine and also
    if I
    use srun from the same host as slurmctld.
    All the other nodes works just fine as they always has, only 2
    nodes are
    experiencing this problem. Very strange...

    Have checked network connectivity and DNS and that is OK. I can ping,
    ssh to all nodes just fine. All nodes are identical and using
    Slurm 18.08.
    Also tested to reboot the 2 nodes and slurmctld but still same
    problem.

    [alex@li1 ~]$ srun -w cn7 hostname
    srun: error: fwd_tree_thread: can't find address for host cn7, check
    slurm.conf
    srun: error: Task launch for 1088816.0 failed on node cn7: Can't
    find an
    address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check
    slurm.conf
    srun: Job step aborted: Waiting up to 32 seconds for job step to
    finish.
    srun: error: Timed out waiting for job step to complete

    [alex@li1 ~]$ srun -w cn6 hostname
    cn6.ydesign.se <http://cn6.ydesign.se>

    What is this error "can't find address for host" about? Have searched
    the web but can't find any good information about what the problem
    is or
    what to do to resolve it.

    Any kind soul out there who knows what to do next?

    Regards,
    Alexander Åhman



Reply via email to