[OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-06-16 Thread Hammond, Simon David via users
Hi OpenMPI Team,

We have recently updated an install of OpenMPI on POWER9 system (configuration 
details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a 
symptom where code than ran before is now locking up and making no progress, 
getting stuck in wait-all operations. While I think it's prudent for us to root 
cause this a little more, I have gone back and rebuilt MPI and re-run the "make 
check" tests. The opal_fifo test appears to hang forever. I am not sure if this 
is the cause of our issue but wanted to report that we are seeing this on our 
system.

OpenMPI 3.1.0 Configuration:

./configure 
--prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
--with-lsf=/opt/lsf/10.1 
--with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

Output:

make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo


Output from Top:

20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-06-16 Thread Hammond, Simon David via users
The output from the test in question is:

Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush


S.
 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:

Hi OpenMPI Team,

We have recently updated an install of OpenMPI on POWER9 system 
(configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We 
seem to have a symptom where code than ran before is now locking up and making 
no progress, getting stuck in wait-all operations. While I think it's prudent 
for us to root cause this a little more, I have gone back and rebuilt MPI and 
re-run the "make check" tests. The opal_fifo test appears to hang forever. I am 
not sure if this is the cause of our issue but wanted to report that we are 
seeing this on our system.

OpenMPI 3.1.0 Configuration:

./configure 
--prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
--with-lsf=/opt/lsf/10.1 
--with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

Output:

make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo


Output from Top:

20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-06-16 Thread Nathan Hjelm
Try the latest nightly tarball for v3.1.x. Should be fixed. 

> On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users 
>  wrote:
> 
> The output from the test in question is:
> 
> Single thread test. Time: 0 s 10182 us 10 nsec/poppush
> Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
> 
> 
> S.
> 
> -- 
> Si Hammond
> Scalable Computer Architectures
> Sandia National Laboratories, NM, USA
> [Sent from remote connection, excuse typos]
> 
> 
> On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:
> 
>Hi OpenMPI Team,
> 
>We have recently updated an install of OpenMPI on POWER9 system 
> (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. 
> We seem to have a symptom where code than ran before is now locking up and 
> making no progress, getting stuck in wait-all operations. While I think it's 
> prudent for us to root cause this a little more, I have gone back and rebuilt 
> MPI and re-run the "make check" tests. The opal_fifo test appears to hang 
> forever. I am not sure if this is the cause of our issue but wanted to report 
> that we are seeing this on our system.
> 
>OpenMPI 3.1.0 Configuration:
> 
>./configure 
> --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
>  --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
> --with-lsf=/opt/lsf/10.1 
> --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
> 
>GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
> POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
>RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
> 
>Output:
> 
>make[3]: Entering directory 
> `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
>make[4]: Entering directory 
> `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
>PASS: ompi_rb_tree
>PASS: opal_bitmap
>PASS: opal_hash_table
>PASS: opal_proc_table
>PASS: opal_tree
>PASS: opal_list
>PASS: opal_value_array
>PASS: opal_pointer_array
>PASS: opal_lifo
>
> 
>Output from Top:
> 
>20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
> 
>-- 
>Si Hammond
>Scalable Computer Architectures
>Sandia National Laboratories, NM, USA
>[Sent from remote connection, excuse typos]
> 
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users