Hi OpenMPI Users, I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM ThunderX2 system. Configuration options below. For now, I am using the simplest configuration test we can use on our system.
If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run of the IMB MPI benchmark on a single node (so using shared memory for communication), the test will hang at the 4-rank test case (see below). All four processes seem to be spinning at 100% of a single core. Configure Line: ./configure --prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm --enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` FC=`which armflang` #---------------------------------------------------------------- # Benchmarking Allreduce # #processes = 4 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.02 0.02 4 1000 2.31 2.31 2.31 8 1000 2.37 2.37 2.37 16 1000 2.46 2.46 2.46 32 1000 2.46 2.46 2.46 <Hang forever> When I use GDB to halt the code on one of the ranks and perform backtracing. I get seem to get the following stacks repeated (in a loop). #0 0x0000ffffbe3e765c in opal_timer_linux_get_cycles_sys_timer () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20 #1 0x0000ffffbe36d910 in opal_progress () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20 #2 0x0000ffffbe6f2568 in ompi_request_default_wait () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 #3 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 #4 0x0000ffffbe703000 in PMPI_Barrier () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 #5 0x0000000000402554 in main () (gdb) c Continuing. ^C Program received signal SIGINT, Interrupt. 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so (gdb) bt #0 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so #1 0x0000ffffb793f544 in btl_openib_component_progress () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so #2 0x0000ffffbe36d980 in opal_progress () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20 #3 0x0000ffffbe6f2568 in ompi_request_default_wait () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 #4 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 #5 0x0000ffffbe703000 in PMPI_Barrier () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20 #6 0x0000000000402554 in main () -- Si Hammond Scalable Computer Architectures Sandia National Laboratories, NM, USA [Sent from remote connection, excuse typos] _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users