Hello list, I am trying to run IMB using openmpi-2.0.1/2.1.0 on a 50G 2-node cluster in my lab, but the test does not start. it fails with following error:
Starting for 0 th iteration. Using openmpi LOGPATH: /MPI/Logs/openmpi/imb/runlog-openmpi-np6-n2-0 -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: calypso-rhel73GA Local device: bnxt_re0 -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[25467,1],0]) is on host: calypso-rhel73GA Process 2 ([[25467,1],1]) is on host: pandora-rhel73GA BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- [calypso-rhel73GA:12532] *** An error occurred in MPI_Bcast [calypso-rhel73GA:12532] *** reported by process [140683322785793,0] [calypso-rhel73GA:12532] *** on communicator MPI_COMM_WORLD [calypso-rhel73GA:12532] *** MPI_ERR_INTERN: internal error [calypso-rhel73GA:12532] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [calypso-rhel73GA:12532] *** and potentially your MPI job) *** Error in `/usr/local/imb/openmpi/dcheck/IMB-MPI1': free(): invalid pointer: 0x00007ff37b2f34d8 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x7c503)[0x7ff37a9ac503] /usr/local/mpi/openmpi/lib/libmpi.so.20(+0x58d17)[0x7ff37af65d17] /usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_mpi_errors_are_fatal_comm_handler+0x105)[0x7ff37af66485] /usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_errhandler_invoke+0x115)[0x7ff37af659c5] /usr/local/mpi/openmpi/lib/libmpi.so.20(MPI_Bcast+0x1a3)[0x7ff37af86743] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402dd7] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x401e0b] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff37a951b35] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402744] ======= Memory map: ======== 00400000-00415000 r-xp 00000000 fd:00 33970734 /usr/local/imb/openmpi/dcheck/IMB-MPI1 00614000-00615000 r--p 00014000 fd:00 33970734 /usr/local/imb/openmpi/dcheck/IMB-MPI1 00615000-00616000 rw-p 00015000 fd:00 33970734 /usr/local/imb/openmpi/dcheck/IMB-MPI1 00616000-0061a000 rw-p 00000000 00:00 0 00f2c000-01071000 rw-p 00000000 00:00 0 [heap] 7ff35ffff000-7ff368000000 rw-s 00000000 fd:00 17524899 /tmp/openmpi-sessions-0@calypso-rhel73GA_0/25467/1/shared_mem_pool.calypso-rhel73GA (deleted) 7ff368000000-7ff368021000 rw-p 00000000 00:00 0 7ff368021000-7ff36c000000 ---p 00000000 00:00 0 7ff36c000000-7ff36c021000 rw-p 00000000 00:00 0 7ff36c021000-7ff370000000 ---p 00000000 00:00 0 7ff370000000-7ff370021000 rw-p 00000000 00:00 0 7ff370021000-7ff374000000 ---p 00000000 00:00 0 7ff374698000-7ff37469e000 r-xp 00000000 fd:00 51165705 /usr/local/lib/libbnxtre-rdmav2.so 7ff37469e000-7ff37489d000 ---p 00006000 fd:00 51165705 /usr/local/lib/libbnxtre-rdmav2.so 7ff37489d000-7ff37489e000 r--p 00005000 fd:00 51165705 /usr/local/lib/libbnxtre-rdmav2.so 7ff37489e000-7ff37489f000 rw-p 00006000 fd:00 51165705 /usr/local/lib/libbnxtre-rdmav2.so 7ff37489f000-7ff3748a4000 r-xp 00000000 fd:00 252310528 /usr/lib64/libibverbs/libcxgb3-rdmav2.so 7ff3748a4000-7ff374aa3000 ---p 00005000 fd:00 252310528 /usr/lib64/libibverbs/libcxgb3-rdmav2.so 7ff374aa3000-7ff374aa4000 r--p 00004000 fd:00 252310528 /usr/lib64/libibverbs/libcxgb3-rdmav2.so 7ff374aa4000-7ff374aa5000 rw-p 00005000 fd:00 252310528 /usr/lib64/libibverbs/libcxgb3-rdmav2.so 7ff374aa5000-7ff374aac000 r-xp 00000000 fd:00 252310529 /usr/lib64/libibverbs/libcxgb4-rdmav2.so 7ff374aac000-7ff374cab000 ---p 00007000 fd:00 252310529 /usr/lib64/libibverbs/libcxgb4-rdmav2.so 7ff374cab000-7ff374cac000 r--p 00006000 fd:00 252310529 /usr/lib64/libibverbs/libcxgb4-rdmav2.so 7ff374cac000-7ff374cad000 rw-p 00007000 fd:00 252310529 /usr/lib64/libibverbs/libcxgb4-rdmav2.so 7ff374cad000-7ff374cb1000 r-xp 00000000 fd:00 252310530 /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so 7ff374cb1000-7ff374eb0000 ---p 00004000 fd:00 252310530 /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so 7ff374eb0000-7ff374eb1000 r--p 00003000 fd:00 252310530 /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so 7ff374eb1000-7ff374eb2000 rw-p 00004000 fd:00 252310530 /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so 7ff374eb2000-7ff374eb7000 r-xp 00000000 fd:00 252310531 /usr/lib64/libibverbs/libhns-rdmav2.so 7ff374eb7000-7ff3750b6000 ---p 00005000 fd:00 252310531 /usr/lib64/libibverbs/libhns-rdmav2.so 7ff3750b6000-7ff3750b7000 r--p 00004000 fd:00 252310531 /usr/lib64/libibverbs/libhns-rdmav2.so 7ff3750b7000-7ff3750b8000 rw-p 00005000 fd:00 252310531 /usr/lib64/libibverbs/libhns-rdmav2.so 7ff3750b8000-7ff3750be000 r-xp 00000000 fd:00 252310532 /usr/lib64/libibverbs/libi40iw-rdmav2.so 7ff3750be000-7ff3752be000 ---p 00006000 fd:00 252310532 /usr/lib64/libibverbs/libi40iw-rdmav2.so 7ff3752be000-7ff3752bf000 r--p 00006000 fd:00 252310532 /usr/lib64/libibverbs/libi40iw-rdmav2.so 7ff3752bf000-7ff3752c0000 rw-p 00007000 fd:00 252310532 /usr/lib64/libibverbs/libi40iw-rdmav2.so 7ff3752c0000-7ff3752c4000 r-xp 00000000 fd:00 252310533 /usr/lib64/libibverbs/libipathverbs-rdmav2.so 7ff3752c4000-7ff3754c3000 ---p 00004000 fd:00 252310533 /usr/lib64/libibverbs/libipathverbs-rdmav2.so 7ff3754c3000-7ff3754c4000 r--p 00003000 fd:00 252310533 /usr/lib64/libibverbs/libipathverbs-rdmav2.so 7ff3754c4000-7ff3754c5000 rw-p 00004000 fd:00 252310533 /usr/lib64/libibverbs/libipathverbs-rdmav2.so 7ff3754c5000-7ff3754cd000 r-xp 00000000 fd:00 252310534 /usr/lib64/libibverbs/libmlx4-rdmav2.so 7ff3754cd000-7ff3756cc000 ---p 00008000 fd:00 252310534 /usr/lib64/libibverbs/libmlx4-rdmav2.so 7ff3756ce000-7ff3756e5000 r-xp 00000000 fd:00 252310535 /usr/lib64/libibverbs/libmlx5-rdmav2.so 7ff3756e5000-7ff3758e4000 ---p 00017000 fd:00 252310535 /usr/lib64/libibverbs/libmlx5-rdmav2.so 7ff3758e4000-7ff3758e5000 r--p 00016000 fd:00 252310535 /usr/lib64/libibverbs/libmlx5-rdmav2.so 7ff3758e5000-7ff3758e6000 rw-p 00017000 fd:00 252310535 /usr/lib64/libibverbs/libmlx5-rdmav2.so 7ff3758e6000-7ff3758ee000 r-xp 00000000 fd:00 252310536 /usr/lib64/libibverbs/libmthca-rdmav2.so 7ff3758ee000-7ff375aed000 ---p 00008000 fd:00 252310536 /usr/lib64/libibverbs/libmthca-rdmav2.so 7ff375aed000-7ff375aee000 r--p 00007000 fd:00 252310536 /usr/lib64/libibverbs/libmthca-rdmav2.so 7ff375aee000-7ff375aef000 rw-p 00008000 fd:00 252310536 /usr/lib64/libibverbs/libmthca-rdmav2.so 7ff375aef000-7ff375af4000 r-xp 00000000 fd:00 252310537 /usr/lib64/libibverbs/libnes-rdmav2.so 7ff375af4000-7ff375cf3000 ---p 00005000 fd:00 252310537 /usr/lib64/libibverbs/libnes-rdmav2.so 7ff375cf3000-7ff375cf4000 r--p 00004000 fd:00 252310537 /usr/lib64/libibverbs/libnes-rdmav2.so 7ff375cf4000-7ff375cf5000 rw-p 00005000 fd:00 252310537 /usr/lib64/libibverbs/libnes-rdmav2.so 7ff375cf5000-7ff375cfb000 r-xp 00000000 fd:00 252310538 /usr/lib64/libibverbs/libocrdma-rdmav2.so 7ff375cfb000-7ff375efa000 ---p 00006000 fd:00 252310538 /usr/lib64/libibverbs/libocrdma-rdmav2.so 7ff375efa000-7ff375efb000 r--p 00005000 fd:00 252310538 /usr/lib64/libibverbs/libocrdma-rdmav2.so[calypso-rhel73GA:12532] *** Process received signal *** [calypso-rhel73GA:12532] Signal: Aborted (6) [calypso-rhel73GA:12532] Signal code: (-6) [calypso-rhel73GA:12532] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7ff37ad00370] [calypso-rhel73GA:12532] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff37a9651d7] [calypso-rhel73GA:12532] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff37a9668c8] [calypso-rhel73GA:12532] [ 3] /lib64/libc.so.6(+0x74f07)[0x7ff37a9a4f07] [calypso-rhel73GA:12532] [ 4] /lib64/libc.so.6(+0x7c503)[0x7ff37a9ac503] [calypso-rhel73GA:12532] [ 5] /usr/local/mpi/openmpi/lib/libmpi.so.20(+0x58d17)[0x7ff37af65d17] [calypso-rhel73GA:12532] [ 6] /usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_mpi_errors_are_fatal_comm_handler+0x105)[0x7ff37af66485] [calypso-rhel73GA:12532] [ 7] /usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_errhandler_invoke+0x115)[0x7ff37af659c5] [calypso-rhel73GA:12532] [ 8] /usr/local/mpi/openmpi/lib/libmpi.so.20(MPI_Bcast+0x1a3)[0x7ff37af86743] [calypso-rhel73GA:12532] [ 9] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402dd7] [calypso-rhel73GA:12532] [10] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x401e0b] [calypso-rhel73GA:12532] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff37a951b35] [calypso-rhel73GA:12532] [12] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402744] [calypso-rhel73GA:12532] *** End of error message *** Following are the run-time parameters I used: mpirun -np 6 -hostfile./hostfile --mca btl openib,self,sm --mca btl_openib_receive_queues P,65536,256,192,128 -mca btl_openib_cpc_include rdmacm -mca pml ob1 --allow-run-as-root --bind-to none --map-by node /usr/local/imb/openmpi/IMB-MPI1 After digging a little in the openmpi source code I figured out that, openmpi is failing because the Speed returned by my device is "64" (50G link speed). It worked only when I applied this patch to the source: diff --git a/opal/mca/common/verbs/common_verbs_port.c b/opal/mca/common/verbs/common_verbs_port.c index 831ba3f..e1d5834 100644 --- a/opal/mca/common/verbs/common_verbs_port.c +++ b/opal/mca/common/verbs/common_verbs_port.c @@ -68,6 +68,10 @@ int opal_common_verbs_port_bw(struct ibv_port_attr *port_attr, /* EDR: 25.78125 Gbps * 64/66, in megabits */ *bandwidth = 25000; break; + case 64: + /* EDR: 25.78125 Gbps * 64/66, in megabits */ + *bandwidth = 50000; + break; default: /* Who knows? */ return OPAL_ERR_NOT_FOUND; I think this change needs to be included in the openmpi code to support 50G RoCE devices. The above double free problem still need someone's attention. Following are the entries in the .ini file (just for reference) : vendor_id = 0x14e4 vendor_part_id = 0x16d7 use_eager_rdma = 1 mtu = 1024 receive_queues = P,65536,256,192,128 max_inline_data = 96 -Regards Devesh _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users