You may find some initial XRC tuning documentation here : https://svn.open-mpi.org/trac/ompi/ticket/1260
Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote: > Hi, > > Please try running OMPI with XRC: > > mpirun --mca btl openib... --mca btl_openib_receive_queues > X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ... > > XRC (eXtended Reliable Connection) decreases memory consumption > of Open MPI by decreasing number of QP per machine. > > I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm > sure it is on later version of the 1.4 series (1.4.3). > > BTW, I do know that the command line is extremely user friendly > and completely intuitive... :-) > I'll have an XRC entry on the OMPI FAQ web page in a day or two, > and you can find more details about this issue. > > OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics > > -- YK > > On 28-Jul-11 7:53 AM, 吕慧伟 wrote: >> Dear all, >> >> I have encounted a problem concerns running large jobs on SMP cluster with >> Open MPI 1.4. >> The application need all-to-all communication, each process send messages to >> all other processes via MPI_Isend. It goes fine when running 256 processes, >> the problems occurs when process number >=512. >> >> The error message is: >> mpirun -np 512 -machinefile machinefile.512 ./my_app >> >> [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] >> error creating qp errno says Cannot allocate memory >> ... >> >> [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] >> error in endpoint reply start connect >> >> [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] >> error creating qp errno says Cannot allocate memory >> ... >> mpirun has exited due to process rank 424 with PID 26841 on >> node gh31 exiting without calling "finalize". >> >> Related post >> (hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests it >> may run out of HCA QP resources. So I checked my system configuration with >> 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 256 >> processes would be under the limit: 256^2 = 65536 < 261056, but 512^2 = >> 262144 > 261056. >> My question is: how to increase the max_qp number of InfiniBand or how to >> get around this problem in MPI? >> >> Thanks in advance for any help you may give! >> >> Huiwei Lv >> PhD Student at Institute of Computing Technology >> >> ------------------------- >> p.s. The system informantion is provided below: >> $ ompi_info -v ompi full --parsable >> ompi:version:full:1.4 >> ompi:version:svn:r22285 >> ompi:version:release_date:Dec 08, 2009 >> $ uname -a >> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 >> x86_64 x86_64 GNU/Linux >> $ ulimit -l >> unlimited >> $ ibv_devinfo -v >> hca_id: mlx4_0 >> transport: InfiniBand (0) >> fw_ver: 2.7.000 >> node_guid: 00d2:c910:0001:b6c0 >> sys_image_guid: 00d2:c910:0001:b6c3 >> vendor_id: 0x02c9 >> vendor_part_id: 26428 >> hw_ver: 0xB0 >> board_id: MT_0D20110009 >> phys_port_cnt: 1 >> max_mr_size: 0xffffffffffffffff >> page_size_cap: 0xfffffe00 >> max_qp: 261056 >> max_qp_wr: 16351 >> device_cap_flags: 0x00fc9c76 >> max_sge: 32 >> max_sge_rd: 0 >> max_cq: 65408 >> max_cqe: 4194303 >> max_mr: 524272 >> max_pd: 32764 >> max_qp_rd_atom: 16 >> max_ee_rd_atom: 0 >> max_res_rd_atom: 4176896 >> max_qp_init_rd_atom: 128 >> max_ee_init_rd_atom: 0 >> atomic_cap: ATOMIC_HCA (1) >> max_ee: 0 >> max_rdd: 0 >> max_mw: 0 >> max_raw_ipv6_qp: 0 >> max_raw_ethy_qp: 1 >> max_mcast_grp: 8192 >> max_mcast_qp_attach: 56 >> max_total_mcast_qp_attach: 458752 >> max_ah: 0 >> max_fmr: 0 >> max_srq: 65472 >> max_srq_wr: 16383 >> max_srq_sge: 31 >> max_pkeys: 128 >> local_ca_ack_delay: 15 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 86 >> port_lid: 73 >> port_lmc: 0x00 >> link_layer: IB >> max_msg_sz: 0x40000000 >> port_cap_flags: 0x02510868 >> max_vl_num: 8 (4) >> bad_pkey_cntr: 0x0 >> qkey_viol_cntr: 0x0 >> sm_sl: 0 >> pkey_tbl_len: 128 >> gid_tbl_len: 128 >> subnet_timeout: 18 >> init_type_reply: 0 >> active_width: 4X (2) >> active_speed: 10.0 Gbps (4) >> phys_state: LINK_UP (5) >> GID[ 0]: >> fe80:0000:0000:0000:00d2:c910:0001:b6c1 >> >> Related threads in the list: >> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php >> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > hxxp://www.open-mpi.org/mailman/listinfo.cgi/users