Hi OpenMPI Users,

I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM 
ThunderX2 system. Configuration options below. For now, I am using the simplest 
configuration test we can use on our system.

If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run of 
the IMB MPI benchmark on a single node (so using shared memory for 
communication), the test will hang at the 4-rank test case (see below). All 
four processes seem to be spinning at 100% of a single core.

Configure Line: ./configure 
--prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm 
--enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` 
FC=`which armflang`

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 4
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02
            4         1000         2.31         2.31         2.31
            8         1000         2.37         2.37         2.37
           16         1000         2.46         2.46         2.46
           32         1000         2.46         2.46         2.46 
<Hang forever>

When I use GDB to halt the code on one of the ranks and perform backtracing. I 
get seem to get the following stacks repeated (in a loop).

#0  0x0000ffffbe3e765c in opal_timer_linux_get_cycles_sys_timer ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#1  0x0000ffffbe36d910 in opal_progress ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#2  0x0000ffffbe6f2568 in ompi_request_default_wait ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#3  0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4  0x0000ffffbe703000 in PMPI_Barrier () from 
/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5  0x0000000000402554 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
(gdb) bt
#0  0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
#1  0x0000ffffb793f544 in btl_openib_component_progress ()
   from 
/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so
#2  0x0000ffffbe36d980 in opal_progress ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#3  0x0000ffffbe6f2568 in ompi_request_default_wait ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4  0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
   from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5  0x0000ffffbe703000 in PMPI_Barrier () from 
/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#6  0x0000000000402554 in main ()

 
-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
 

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to