On Thu, Nov 30, 2017 at 6:32 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Ah, I was misled by the subject. > > Can you provide more information about "hangs", and your environment? > > You previously cited: > > - E5-2697A v4 CPUs and Mellanox ConnectX-3 FDR Infiniband > - SLRUM > - Open MPI v3.0.0 > - IMB-MPI1 > > Can you send the information listed here: > > https://www.open-mpi.org/community/help/ > > BTW, the fact that you fixed the last error by growing the tmpdir size > (admittedly: we should probably have a better error message here, and > shouldn't just segv like you were seeing -- I'll open a bug on that), you can > probably remove "--mca btl ^vader" or other similar CLI options. vader and > sm were [probably?] failing due to the memory-mapped files on the filesystem > running out of space and Open MPI not handling it well. Meaning: in general, > you don't want to turn off shared memory support, because that will likely > always be the fastest for on-node communication. Hi Jeff,
yes, it was wrong to simply close the issue with openmpi 1.10. But now about the current problem: I am using the packages provided by OpenHPC, so I didn't build openmpi myself and don't have config.log. The package version is openmpi3-gnu7-ohpc-3.0.0-35.1.x86_64. Attached is the output of ompi_info --all. The FAQ entry must be outdated, as this happened: % ompi_info -v ompi full --parsable ompi_info: Error: unknown option "-v" Type 'ompi_info --help' for usage. I have attached my slurm job script, it will simply do an mpirun IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for instance, vader is enabled. The bug's effect is that the program will provide standard output for over 30 minutes, then all processes will keep running with 100% CPU until they are killed by the slurm job limit (2 hours in the example). The Infiniband network seems to be working fine. I'm using Red Hat's OFED from RHEL7.4 (it really is Scientific Linux 7.4). I am running opensm on one of the nodes. Regards, Götz
ompi_info.txt.bz2
Description: BZip2 compressed data
slurm-mpitest-openmpi3.job
Description: Binary data
slurm-2715.out.bz2
Description: BZip2 compressed data
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users