Hi,

Please try running OMPI with XRC:

  mpirun --mca btl openib... --mca btl_openib_receive_queues 
X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ...

XRC (eXtended Reliable Connection) decreases memory consumption
of Open MPI by decreasing number of QP per machine.

I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm
sure it is on later version of the 1.4 series (1.4.3).

BTW, I do know that the command line is extremely user friendly
and completely intuitive... :-)
I'll have an XRC entry on the OMPI FAQ web page in a day or two,
and you can find more details about this issue.

OMPI FAQ: http://www.open-mpi.org/faq/?category=openfabrics

-- YK

On 28-Jul-11 7:53 AM, 吕慧伟 wrote:
> Dear all,
> 
> I have encounted a problem concerns running large jobs on SMP cluster with 
> Open MPI 1.4.
> The application need all-to-all communication, each process send messages to 
> all other processes via MPI_Isend. It goes fine when running 256 processes, 
> the problems occurs when process number >=512.
> 
> The error message is:
>          mpirun -np 512 -machinefile machinefile.512 ./my_app
>          
> [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] 
> error creating qp errno says Cannot allocate memory
>          ...
>          
> [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error 
> in endpoint reply start connect
>          
> [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] 
> error creating qp errno says Cannot allocate memory
>          ...
>          mpirun has exited due to process rank 424 with PID 26841 on
>          node gh31 exiting without calling "finalize".
> 
> Related post (http://www.open-mpi.org/community/lists/users/2009/07/9786.php) 
> suggests it may run out of HCA QP resources. So I checked my system 
> configuration with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, 
> running with 256 processes would be under the limit: 256^2 = 65536 < 261056, 
> but 512^2 = 262144 > 261056.
> My question is: how to increase the max_qp number of InfiniBand or how to get 
> around this problem in MPI?
> 
> Thanks in advance for any help you may give!
> 
> Huiwei Lv
> PhD Student at Institute of Computing Technology
> 
> -------------------------
> p.s. The system informantion is provided below:
> $ ompi_info -v ompi full --parsable
> ompi:version:full:1.4
> ompi:version:svn:r22285
> ompi:version:release_date:Dec 08, 2009
> $ uname -a
> Linux gh26 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 
> x86_64 GNU/Linux
> $ ulimit -l
> unlimited
> $ ibv_devinfo -v
> hca_id: mlx4_0
>          transport:                      InfiniBand (0)
>          fw_ver:                         2.7.000
>          node_guid:                      00d2:c910:0001:b6c0
>          sys_image_guid:                 00d2:c910:0001:b6c3
>          vendor_id:                      0x02c9
>          vendor_part_id:                 26428
>          hw_ver:                         0xB0
>          board_id:                       MT_0D20110009
>          phys_port_cnt:                  1
>          max_mr_size:                    0xffffffffffffffff
>          page_size_cap:                  0xfffffe00
>          max_qp:                         261056
>          max_qp_wr:                      16351
>          device_cap_flags:               0x00fc9c76
>          max_sge:                        32
>          max_sge_rd:                     0
>          max_cq:                         65408
>          max_cqe:                        4194303
>          max_mr:                         524272
>          max_pd:                         32764
>          max_qp_rd_atom:                 16
>          max_ee_rd_atom:                 0
>          max_res_rd_atom:                4176896
>          max_qp_init_rd_atom:            128
>          max_ee_init_rd_atom:            0
>          atomic_cap:                     ATOMIC_HCA (1)
>          max_ee:                         0
>          max_rdd:                        0
>          max_mw:                         0
>          max_raw_ipv6_qp:                0
>          max_raw_ethy_qp:                1
>          max_mcast_grp:                  8192
>          max_mcast_qp_attach:            56
>          max_total_mcast_qp_attach:      458752
>          max_ah:                         0
>          max_fmr:                        0
>          max_srq:                        65472
>          max_srq_wr:                     16383
>          max_srq_sge:                    31
>          max_pkeys:                      128
>          local_ca_ack_delay:             15
>                  port:   1
>                          state:                  PORT_ACTIVE (4)
>                          max_mtu:                2048 (4)
>                          active_mtu:             2048 (4)
>                          sm_lid:                 86
>                          port_lid:               73
>                          port_lmc:               0x00
>                          link_layer:             IB
>                          max_msg_sz:             0x40000000
>                          port_cap_flags:         0x02510868
>                          max_vl_num:             8 (4)
>                          bad_pkey_cntr:          0x0
>                          qkey_viol_cntr:         0x0
>                          sm_sl:                  0
>                          pkey_tbl_len:           128
>                          gid_tbl_len:            128
>                          subnet_timeout:         18
>                          init_type_reply:        0
>                          active_width:           4X (2)
>                          active_speed:           10.0 Gbps (4)
>                          phys_state:             LINK_UP (5)
>                          GID[  0]:               
> fe80:0000:0000:0000:00d2:c910:0001:b6c1
> 
> Related threads in the list:
> http://www.open-mpi.org/community/lists/users/2009/07/9786.php
> http://www.open-mpi.org/community/lists/users/2009/08/10456.php
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to