Howard,

This fixed the issue with OpenMPI 3.1.0. Do you want me to try the same with 
3.1.1 as well?

S.

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA


From: users <users-boun...@lists.open-mpi.org> on behalf of Howard Pritchard 
<hpprit...@gmail.com>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Monday, July 2, 2018 at 1:34 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ 
CUDA9.2

HI Si,

Could you add --disable-builtin-atomics

to the configure options and see if the hang goes away?

Howard


2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>:
Simon --

You don't currently have another Open MPI installation in your PATH / 
LD_LIBRARY_PATH, do you?

I have seen dependency library loads cause "make check" to get confused, and 
instead of loading the libraries from the build tree, actually load some -- but 
not all -- of the required OMPI/ORTE/OPAL/etc. libraries from an installation 
tree.  Hilarity ensues (to include symptoms such as running forever).

Can you double check that you have no Open MPI libraries in your 
LD_LIBRARY_PATH before running "make check" on the build tree?



> On Jun 30, 2018, at 3:18 PM, Hammond, Simon David via users 
> <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
>
> Nathan,
>
> Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
>
> S.
>
> --
> Si Hammond
> Scalable Computer Architectures
> Sandia National Laboratories, NM, USA
> [Sent from remote connection, excuse typos]
>
>
> On 6/16/18, 10:10 PM, "Nathan Hjelm" <hje...@me.com<mailto:hje...@me.com>> 
> wrote:
>
>    Try the latest nightly tarball for v3.1.x. Should be fixed.
>
>> On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users 
>> <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
>>
>> The output from the test in question is:
>>
>> Single thread test. Time: 0 s 10182 us 10 nsec/poppush
>> Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
>> <then runs forever>
>>
>> S.
>>
>> --
>> Si Hammond
>> Scalable Computer Architectures
>> Sandia National Laboratories, NM, USA
>> [Sent from remote connection, excuse typos]
>>
>>
>> On 6/16/18, 5:45 PM, "Hammond, Simon David" 
>> <sdha...@sandia.gov<mailto:sdha...@sandia.gov>> wrote:
>>
>>   Hi OpenMPI Team,
>>
>>   We have recently updated an install of OpenMPI on POWER9 system 
>> (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. 
>> We seem to have a symptom where code than ran before is now locking up and 
>> making no progress, getting stuck in wait-all operations. While I think it's 
>> prudent for us to root cause this a little more, I have gone back and 
>> rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to 
>> hang forever. I am not sure if this is the cause of our issue but wanted to 
>> report that we are seeing this on our system.
>>
>>   OpenMPI 3.1.0 Configuration:
>>
>>   ./configure 
>> --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
>>  --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
>> --with-lsf=/opt/lsf/10.1 
>> --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
>>
>>   GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
>> POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
>>   RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
>>
>>   Output:
>>
>>   make[3]: Entering directory 
>> `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
>>   make[4]: Entering directory 
>> `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
>>   PASS: ompi_rb_tree
>>   PASS: opal_bitmap
>>   PASS: opal_hash_table
>>   PASS: opal_proc_table
>>   PASS: opal_tree
>>   PASS: opal_list
>>   PASS: opal_value_array
>>   PASS: opal_pointer_array
>>   PASS: opal_lifo
>>   <runs forever>
>>
>>   Output from Top:
>>
>>   20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
>>
>>   --
>>   Si Hammond
>>   Scalable Computer Architectures
>>   Sandia National Laboratories, NM, USA
>>   [Sent from remote connection, excuse typos]
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to