Hi,

Following requested info in https://www.open-mpi.org/community/help/ :

Version: v4.1
config.log: n/a
ompi_info —all: https://pastebin.com/69QKAbi5

$ bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]" -n16 mpirun --bynode 
--hostfile my_hostfile.txt --tag-output ompi_info -v ompi full --parsable
bsub.e: https://pastebin.com/TfWCVMbj
bsub.o: https://pastebin.com/F5aExH7T

[since that didn’t work, I created a hosts file with the hosts that the above 
job happened to tile across:]

for h in `cat my_hostfile.txt`
> do
> echo === $h
> ssh $h ompi_info -v ompi full --parsable
> done

[all fail due to ompi_info: Error: unknown option "-v”, and it doesn’t work 
removing just the -v either]

$ ifconfig
on the node I run bsub on: https://pastebin.com/MXUw7pEu
on the node that the above mpirun ran on, where ifconfig isn’t available, so I 
ran $ip a: https://pastebin.com/JnyBfUAv


# description of failure

We had an mpi app running under LSF that worked fine tiled across 64 hosts.

Since moving to a new platform (LSF, but inside an OpenStack cluster*), the app 
is unreliable when tiled across more than 2 hosts.
The likelyhood of failure increases until when tiled across 16 hosts, it almost 
never works (but still can). It always works when using 16 cores of a single 
host.

The symptoms of failure are that our app doesn’t really start up (it logs 
nothing), the -output-filename output directories and files don’t get created, 
and it kills itself after 5mins of apparently doing nothing. (When it works, 
the -output-filename files are created ~immediately and the app produces 
output.)

To rule out it being an issue with our app, I get these same symptoms running:

$ bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]" -n16 mpirun true
bsub.o: https://pastebin.com/RS42CJru
bsub.e: empty

It works fine (turnaround time 4secs) with -n2, but not when submitted while 
the -n16 job is running and we ended up sharing hosts.

It’s worth noting I get the exact same symptoms using mpich-3.4.1 as well, so 
this is not an openmpi-specific issue. mpich’s bsub.e file, and some strace of 
processes, suggests socket write issues.

What can I do to investigate further or try to resolve this, so it works 
reliably with 16 or ideally 64 hosts again?


[*] I know very little about networking, but I’m told we have new nodes in the 
LSF cluster that use a software-defined network, but also old nodes that use a 
hardware network like our old system; limiting to the old nodes doesn’t help. 
But maybe there are some subtleties here I’ve overlooked. The old nodes have 
names bc*, like the ones I happened to run on in some of the above pastebins.

Cheers,
Sendu



-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Reply via email to