On Thu, Nov 30, 2017 at 6:32 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> Ah, I was misled by the subject.
>
> Can you provide more information about "hangs", and your environment?
>
> You previously cited:
>
> - E5-2697A v4 CPUs and Mellanox ConnectX-3 FDR Infiniband
> - SLRUM
> - Open MPI v3.0.0
> - IMB-MPI1
>
> Can you send the information listed here:
>
>     https://www.open-mpi.org/community/help/
>
> BTW, the fact that you fixed the last error by growing the tmpdir size 
> (admittedly: we should probably have a better error message here, and 
> shouldn't just segv like you were seeing -- I'll open a bug on that), you can 
> probably remove "--mca btl ^vader" or other similar CLI options.  vader and 
> sm were [probably?] failing due to the memory-mapped files on the filesystem 
> running out of space and Open MPI not handling it well.  Meaning: in general, 
> you don't want to turn off shared memory support, because that will likely 
> always be the fastest for on-node communication.
Hi Jeff,

yes, it was wrong to simply close the issue with openmpi 1.10. But now
about the current problem:

I am using the packages provided by OpenHPC, so I didn't build openmpi
myself and don't have config.log. The package version is
openmpi3-gnu7-ohpc-3.0.0-35.1.x86_64.
Attached is the output of ompi_info --all.
The FAQ entry must be outdated, as this happened:
% ompi_info -v ompi full --parsable
ompi_info: Error: unknown option "-v"
Type 'ompi_info --help' for usage.

I have attached my slurm job script, it will simply do an mpirun
IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
instance, vader is enabled.

The bug's effect is that the program will provide standard output for
over 30 minutes, then all processes will keep running with 100% CPU
until they are killed by the slurm job limit (2 hours in the example).

The Infiniband network seems to be working fine. I'm using Red Hat's
OFED from RHEL7.4 (it really is Scientific Linux 7.4). I am running
opensm on one of the nodes.


Regards, Götz

Attachment: ompi_info.txt.bz2
Description: BZip2 compressed data

Attachment: slurm-mpitest-openmpi3.job
Description: Binary data

Attachment: slurm-2715.out.bz2
Description: BZip2 compressed data

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to