Hi, Following requested info in https://www.open-mpi.org/community/help/ :
Version: v4.1 config.log: n/a ompi_info —all: https://pastebin.com/69QKAbi5 $ bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]" -n16 mpirun --bynode --hostfile my_hostfile.txt --tag-output ompi_info -v ompi full --parsable bsub.e: https://pastebin.com/TfWCVMbj bsub.o: https://pastebin.com/F5aExH7T [since that didn’t work, I created a hosts file with the hosts that the above job happened to tile across:] for h in `cat my_hostfile.txt` > do > echo === $h > ssh $h ompi_info -v ompi full --parsable > done [all fail due to ompi_info: Error: unknown option "-v”, and it doesn’t work removing just the -v either] $ ifconfig on the node I run bsub on: https://pastebin.com/MXUw7pEu on the node that the above mpirun ran on, where ifconfig isn’t available, so I ran $ip a: https://pastebin.com/JnyBfUAv # description of failure We had an mpi app running under LSF that worked fine tiled across 64 hosts. Since moving to a new platform (LSF, but inside an OpenStack cluster*), the app is unreliable when tiled across more than 2 hosts. The likelyhood of failure increases until when tiled across 16 hosts, it almost never works (but still can). It always works when using 16 cores of a single host. The symptoms of failure are that our app doesn’t really start up (it logs nothing), the -output-filename output directories and files don’t get created, and it kills itself after 5mins of apparently doing nothing. (When it works, the -output-filename files are created ~immediately and the app produces output.) To rule out it being an issue with our app, I get these same symptoms running: $ bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]" -n16 mpirun true bsub.o: https://pastebin.com/RS42CJru bsub.e: empty It works fine (turnaround time 4secs) with -n2, but not when submitted while the -n16 job is running and we ended up sharing hosts. It’s worth noting I get the exact same symptoms using mpich-3.4.1 as well, so this is not an openmpi-specific issue. mpich’s bsub.e file, and some strace of processes, suggests socket write issues. What can I do to investigate further or try to resolve this, so it works reliably with 16 or ideally 64 hosts again? [*] I know very little about networking, but I’m told we have new nodes in the LSF cluster that use a software-defined network, but also old nodes that use a hardware network like our old system; limiting to the old nodes doesn’t help. But maybe there are some subtleties here I’ve overlooked. The old nodes have names bc*, like the ones I happened to run on in some of the above pastebins. Cheers, Sendu -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.