[OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2
Hi OpenMPI Team, We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system. OpenMPI 3.1.0 Configuration: ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0. RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo) Output: make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class' make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class' PASS: ompi_rb_tree PASS: opal_bitmap PASS: opal_hash_table PASS: opal_proc_table PASS: opal_tree PASS: opal_list PASS: opal_value_array PASS: opal_pointer_array PASS: opal_lifo Output from Top: 20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo -- Si Hammond Scalable Computer Architectures Sandia National Laboratories, NM, USA [Sent from remote connection, excuse typos] ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2
The output from the test in question is: Single thread test. Time: 0 s 10182 us 10 nsec/poppush Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush S. -- Si Hammond Scalable Computer Architectures Sandia National Laboratories, NM, USA [Sent from remote connection, excuse typos] On 6/16/18, 5:45 PM, "Hammond, Simon David" wrote: Hi OpenMPI Team, We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system. OpenMPI 3.1.0 Configuration: ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0. RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo) Output: make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class' make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class' PASS: ompi_rb_tree PASS: opal_bitmap PASS: opal_hash_table PASS: opal_proc_table PASS: opal_tree PASS: opal_list PASS: opal_value_array PASS: opal_pointer_array PASS: opal_lifo Output from Top: 20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo -- Si Hammond Scalable Computer Architectures Sandia National Laboratories, NM, USA [Sent from remote connection, excuse typos] ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2
Try the latest nightly tarball for v3.1.x. Should be fixed. > On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users > wrote: > > The output from the test in question is: > > Single thread test. Time: 0 s 10182 us 10 nsec/poppush > Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush > > > S. > > -- > Si Hammond > Scalable Computer Architectures > Sandia National Laboratories, NM, USA > [Sent from remote connection, excuse typos] > > > On 6/16/18, 5:45 PM, "Hammond, Simon David" wrote: > >Hi OpenMPI Team, > >We have recently updated an install of OpenMPI on POWER9 system > (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. > We seem to have a symptom where code than ran before is now locking up and > making no progress, getting stuck in wait-all operations. While I think it's > prudent for us to root cause this a little more, I have gone back and rebuilt > MPI and re-run the "make check" tests. The opal_fifo test appears to hang > forever. I am not sure if this is the cause of our issue but wanted to report > that we are seeing this on our system. > >OpenMPI 3.1.0 Configuration: > >./configure > --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 > --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java > --with-lsf=/opt/lsf/10.1 > --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs > >GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for > POWER9 (standard download from their website). We enable IBM's JDK 8.0.0. >RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo) > >Output: > >make[3]: Entering directory > `/home/sdhammo/openmpi/openmpi-3.1.0/test/class' >make[4]: Entering directory > `/home/sdhammo/openmpi/openmpi-3.1.0/test/class' >PASS: ompi_rb_tree >PASS: opal_bitmap >PASS: opal_hash_table >PASS: opal_proc_table >PASS: opal_tree >PASS: opal_list >PASS: opal_value_array >PASS: opal_pointer_array >PASS: opal_lifo > > >Output from Top: > >20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo > >-- >Si Hammond >Scalable Computer Architectures >Sandia National Laboratories, NM, USA >[Sent from remote connection, excuse typos] > > > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users