We have recently encountered a problem with using openmpi 1.5.3, 1.5.4, and 1.6.2 over compute nodes with two different generations of Infiniband (DDR and QDR). This error is very similar to one posted to the list in 2011: http://www.open-mpi.org/community/lists/users/2011/06/16773.php This issue was never resolved on the mailing list.
Here is the error: ################################################################# iwtf-k43-28$ which mpirun /usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun iwtf-k43-28$cat machinefile iwtf-k43-28 iwm-k43-30 iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0 -------------------------------------------------------------------------- Open MPI detected two different OpenFabrics transport types in the same Infiniband network. Such mixed network trasport configuration is not supported by Open MPI. Local host: iwm-k43-30.pace.gatech.edu Local adapter: mthca0 (vendor 0x2c9, part ID 25204) Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN Remote host: iwtf-k43-28 Remote Adapter: (vendor 0x2c9, part ID 26428) Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB ------------------------------------------------------------------------------------------ Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2 Hello from iwm-k43-30.pace.gatech.edu: 1 of 2 [iwtf-k43-28.pace.gatech.edu:12695] 1 more process has sent help message help-mpi-btl-openib.txt / conflicting transport types [iwtf-k43-28.pace.gatech.edu:12695] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages ---------------------------------------------------------- iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self ./a.out 0 -------------------------------------------------------------------------- Open MPI detected two different OpenFabrics transport types in the same Infiniband network. Such mixed network trasport configuration is not supported by Open MPI. Local host: iwm-k43-30.pace.gatech.edu Local adapter: mthca0 (vendor 0x2c9, part ID 25204) Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN Remote host: iwtf-k43-28 Remote Adapter: (vendor 0x2c9, part ID 26428) Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[34066,1],1]) is on host: iwm-k43-30.pace.gatech.edu Process 2 ([[34066,1],0]) is on host: iwtf-k43-28 BTLs attempted: self openib Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- [iwm-k43-30.pace.gatech.edu:9131] *** An error occurred in MPI_Init [iwm-k43-30.pace.gatech.edu:9131] *** on a NULL communicator [iwm-k43-30.pace.gatech.edu:9131] *** Unknown error [iwm-k43-30.pace.gatech.edu:9131] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: iwm-k43-30.pace.gatech.edu PID: 9131 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 9131 on node iwm-k43-30 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message help-mpi-btl-openib.txt / conflicting transport types [iwtf-k43-28.pace.gatech.edu:13279] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc [iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail [iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle [iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed ################################################################# openmpi 1.4.3 works as expected: iwtf-k43-28$ which mpirun /usr/local/packages/openmpi/1.4.3/gcc-4.4.5/bin/mpirun iwtf-k43-28$ mpicc testmpi.c iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0 Hello from iwm-k43-30.pace.gatech.edu: 1 of 2 Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2 iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self ./a.out 0 Hello from iwm-k43-30.pace.gatech.edu: 1 of 2 Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2 ###################################################################### ###################################################################### 1.5.4 runs fine on iwm-k43-30 by itself: iwtf-k43-28$ cat machinefile iwm-k43-30 iwm-k43-30 iwtf-k43-28$ which mpirun /usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun iwtf-k43-28$ mpicc testmpi.c iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self ./a.out 0 Hello from iwm-k43-30.pace.gatech.edu: 0 of 2 Hello from iwm-k43-30.pace.gatech.edu: 1 of 2 It is only when mixing and matching hosts that it fails. ###################################################################### Relevant system information: - Same error on RHEL6.2 and RHEL6.3. iwtf-k43-28$ uname -a Linux iwtf-k43-28.pace.gatech.edu 2.6.32-220.23.1.el6.x86_64 #1 SMP Tue Jun 12 11:20:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux iwm-k43-30$ uname -a Linux iwm-k43-30.pace.gatech.edu 2.6.32-220.23.1.el6.x86_64 #1 SMP Tue Jun 12 11:20:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux # rpm -qa | grep -i verb libibverbs-debuginfo-1.1.4-1.24.gb89d4d7.x86_64 libibverbs-1.1.4-1.24.gb89d4d7.x86_64 libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64 libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64 libipathverbs-1.2-1.x86_64 libipathverbs-debuginfo-1.2-1.x86_64 libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64 libipathverbs-devel-1.2-1.x86_64 # rpm -qa | grep libmthca libmthca-1.0.5-0.1.gbe5eef3.x86_64 libmthca-debuginfo-1.0.5-0.1.gbe5eef3.x86_64 libmthca-devel-static-1.0.5-0.1.gbe5eef3.x86_64 # rpm -qa | grep libmlx libmlx4-devel-1.0.1-1.20.g6771d22.x86_64 libmlx4-debuginfo-1.0.1-1.20.g6771d22.x86_64 libmlx4-1.0.1-1.20.g6771d22.x86_64 1.4.3 "configure" flags: "--with-tm=/opt/torque/2.4.3 --with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --enable-static" 1.5.4 "configure" flags: "--with-tm=/opt/torque/2.4.3 --with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --with-hwloc=/usr/local/packages/hwloc/1.2/ --enable-static" 1.6.2 "configure" flags: "--with-tm=/opt/torque/2.4.3 --with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --enable-static --with-knem" iwm-k43-30# ibv_devinfo -v hca_id: mthca0 transport: InfiniBand (0) fw_ver: 1.2.0 node_guid: 0002:c902:0029:8434 sys_image_guid: 0002:c902:0029:8437 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0150002 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffff000 max_qp: 64512 max_qp_wr: 16384 device_cap_flags: 0x00001c76 max_sge: 27 max_sge_rd: 0 max_cq: 65408 max_cqe: 131071 max_mr: 131056 max_pd: 32764 max_qp_rd_atom: 4 max_ee_rd_atom: 0 max_res_rd_atom: 258048 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 8192 max_mcast_qp_attach: 56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr: 0 max_srq: 960 max_srq_wr: 16384 max_srq_sge: 27 max_pkeys: 64 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 465 port_lid: 54 port_lmc: 0x00 link_layer: IB max_msg_sz: 0x80000000 port_cap_flags: 0x02510a68 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 64 gid_tbl_len: 32 subnet_timeout: 17 init_type_reply: 0 active_width: 4X (2) active_speed: 5.0 Gbps (2) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:0002:c902:0029:8435 ################################################################## ################################################################## iwtf-k43-28# ibv_devinfo -v hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0002:c903:004b:2170 sys_image_guid: 0002:c903:004b:2173 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffe00 max_qp: 261056 max_qp_wr: 16351 device_cap_flags: 0x007c9c76 max_sge: 32 max_sge_rd: 0 max_cq: 65408 max_cqe: 4194303 max_mr: 524272 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4176896 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 1 max_mcast_grp: 8192 max_mcast_qp_attach: 120 max_total_mcast_qp_attach: 983040 max_ah: 0 max_fmr: 0 max_srq: 65472 max_srq_wr: 16383 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 465 port_lid: 35 port_lmc: 0x00 link_layer: IB max_msg_sz: 0x40000000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 128 subnet_timeout: 17 init_type_reply: 0 active_width: 4X (2) active_speed: 10.0 Gbps (4) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:0002:c903:004b:2171 ################################################################## -- Wesley Emeneker, Research Scientist The Partnership for an Advanced Computing Environment Georgia Institute of Technology 404.385.2303 wesley.emene...@oit.gatech.edu http://pace.gatech.edu