Hi, Open MPI 2.1.3 and 2.1.4 have a bug in shared memory communication. Open MPI community is preparing 2.1.5 to fix it.
https://github.com/open-mpi/ompi/pull/5536 Could you try this patch? https://github.com/open-mpi/ompi/commit/6086b52719ed02725dfa5e91c0d12c3c66a8e168 Or, use the 2.1.5rc1 (release candidate)? https://www.open-mpi.org/software/ompi/v2.1/ Thanks, Takahiro Kawashima, MPI development team, Fujitsu > Hi OpenMPI Users, > > I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM > ThunderX2 system. Configuration options below. For now, I am using the > simplest configuration test we can use on our system. > > If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run > of the IMB MPI benchmark on a single node (so using shared memory for > communication), the test will hang at the 4-rank test case (see below). All > four processes seem to be spinning at 100% of a single core. > > Configure Line: ./configure > --prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm > --enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` > FC=`which armflang` > > #---------------------------------------------------------------- > # Benchmarking Allreduce > # #processes = 4 > #---------------------------------------------------------------- > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > 0 1000 0.02 0.02 0.02 > 4 1000 2.31 2.31 2.31 > 8 1000 2.37 2.37 2.37 > 16 1000 2.46 2.46 2.46 > 32 1000 2.46 2.46 2.46 > <Hang forever> > > When I use GDB to halt the code on one of the ranks and perform backtracing. > I get seem to get the following stacks repeated (in a loop). > > #0 0x0000ffffbe3e765c in opal_timer_linux_get_cycles_sys_timer () > from > /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20 > #1 0x0000ffffbe36d910 in opal_progress () > from > /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20 > #2 0x0000ffffbe6f2568 in ompi_request_default_wait () > from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 > #3 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling () > from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 > #4 0x0000ffffbe703000 in PMPI_Barrier () from > /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 > #5 0x0000000000402554 in main () > (gdb) c > Continuing. > ^C > Program received signal SIGINT, Interrupt. > 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so > (gdb) bt > #0 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so > #1 0x0000ffffb793f544 in btl_openib_component_progress () > from > /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so > #2 0x0000ffffbe36d980 in opal_progress () > from > /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20 > #3 0x0000ffffbe6f2568 in ompi_request_default_wait () > from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 > #4 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling () > from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 > #5 0x0000ffffbe703000 in PMPI_Barrier () from > /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 > #6 0x0000000000402554 in main () _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users