Dear all, I have encounted a problem concerns running large jobs on SMP cluster with Open MPI 1.4. The application need all-to-all communication, each process send messages to all other processes via MPI_Isend. It goes fine when running 256 processes, the problems occurs when process number >=512.
The error message is: mpirun -np 512 -machinefile machinefile.512 ./my_app [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory ... [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory ... mpirun has exited due to process rank 424 with PID 26841 on node gh31 exiting without calling "finalize". Related post (http://www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests it may run out of HCA QP resources. So I checked my system configuration with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 256 processes would be under the limit: 256^2 = 65536 < 261056, but 512^2 = 262144 > 261056. My question is: how to increase the max_qp number of InfiniBand or how to get around this problem in MPI? Thanks in advance for any help you may give! Huiwei Lv PhD Student at Institute of Computing Technology ------------------------- p.s. The system informantion is provided below: $ ompi_info -v ompi full --parsable ompi:version:full:1.4 ompi:version:svn:r22285 ompi:version:release_date:Dec 08, 2009 $ uname -a Linux gh26 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux $ ulimit -l unlimited $ ibv_devinfo -v hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.000 node_guid: 00d2:c910:0001:b6c0 sys_image_guid: 00d2:c910:0001:b6c3 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D20110009 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffe00 max_qp: 261056 max_qp_wr: 16351 device_cap_flags: 0x00fc9c76 max_sge: 32 max_sge_rd: 0 max_cq: 65408 max_cqe: 4194303 max_mr: 524272 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4176896 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 1 max_mcast_grp: 8192 max_mcast_qp_attach: 56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr: 0 max_srq: 65472 max_srq_wr: 16383 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 86 port_lid: 73 port_lmc: 0x00 link_layer: IB max_msg_sz: 0x40000000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 128 subnet_timeout: 18 init_type_reply: 0 active_width: 4X (2) active_speed: 10.0 Gbps (4) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:00d2:c910:0001:b6c1 Related threads in the list: http://www.open-mpi.org/community/lists/users/2009/07/9786.php http://www.open-mpi.org/community/lists/users/2009/08/10456.php