Hi,

Open MPI 2.1.3 and 2.1.4 have a bug in shared memory communication.
Open MPI community is preparing 2.1.5 to fix it.

  https://github.com/open-mpi/ompi/pull/5536

Could you try this patch?

  
https://github.com/open-mpi/ompi/commit/6086b52719ed02725dfa5e91c0d12c3c66a8e168

Or, use the 2.1.5rc1 (release candidate)?

  https://www.open-mpi.org/software/ompi/v2.1/

Thanks,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi OpenMPI Users,
> 
> I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM 
> ThunderX2 system. Configuration options below. For now, I am using the 
> simplest configuration test we can use on our system.
> 
> If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run 
> of the IMB MPI benchmark on a single node (so using shared memory for 
> communication), the test will hang at the 4-rank test case (see below). All 
> four processes seem to be spinning at 100% of a single core.
> 
> Configure Line: ./configure 
> --prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm 
> --enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` 
> FC=`which armflang`
> 
> #----------------------------------------------------------------
> # Benchmarking Allreduce
> # #processes = 4
> #----------------------------------------------------------------
>        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>             0         1000         0.02         0.02         0.02
>             4         1000         2.31         2.31         2.31
>             8         1000         2.37         2.37         2.37
>            16         1000         2.46         2.46         2.46
>            32         1000         2.46         2.46         2.46 
> <Hang forever>
> 
> When I use GDB to halt the code on one of the ranks and perform backtracing. 
> I get seem to get the following stacks repeated (in a loop).
> 
> #0  0x0000ffffbe3e765c in opal_timer_linux_get_cycles_sys_timer ()
>    from 
> /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
> #1  0x0000ffffbe36d910 in opal_progress ()
>    from 
> /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
> #2  0x0000ffffbe6f2568 in ompi_request_default_wait ()
>    from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
> #3  0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
>    from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
> #4  0x0000ffffbe703000 in PMPI_Barrier () from 
> /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
> #5  0x0000000000402554 in main ()
> (gdb) c
> Continuing.
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
> (gdb) bt
> #0  0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
> #1  0x0000ffffb793f544 in btl_openib_component_progress ()
>    from 
> /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so
> #2  0x0000ffffbe36d980 in opal_progress ()
>    from 
> /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
> #3  0x0000ffffbe6f2568 in ompi_request_default_wait ()
>    from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
> #4  0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
>    from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
> #5  0x0000ffffbe703000 in PMPI_Barrier () from 
> /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
> #6  0x0000000000402554 in main ()
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to