Dear all,

I have encounted a problem concerns running large jobs on SMP cluster with
Open MPI 1.4.
The application need all-to-all communication, each process send messages to
all other processes via MPI_Isend. It goes fine when running 256 processes,
the problems occurs when process number >=512.

The error message is:
        mpirun -np 512 -machinefile machinefile.512 ./my_app

 [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one]
error creating qp errno says Cannot allocate memory
        ...

 [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb]
error in endpoint reply start connect

 [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one]
error creating qp errno says Cannot allocate memory
        ...
        mpirun has exited due to process rank 424 with PID 26841 on
        node gh31 exiting without calling "finalize".

Related post (http://www.open-mpi.org/community/lists/users/2009/07/9786.php)
suggests it may run out of HCA QP resources. So I checked my system
configuration with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case,
running with 256 processes would be under the limit: 256^2 = 65536 < 261056,
but 512^2 = 262144 > 261056.
My question is: how to increase the max_qp number of InfiniBand or how to
get around this problem in MPI?

Thanks in advance for any help you may give!

Huiwei Lv
PhD Student at Institute of Computing Technology

-------------------------
p.s. The system informantion is provided below:
$ ompi_info -v ompi full --parsable
ompi:version:full:1.4
ompi:version:svn:r22285
ompi:version:release_date:Dec 08, 2009
$ uname -a
Linux gh26 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
x86_64 GNU/Linux
$ ulimit -l
unlimited
$ ibv_devinfo -v
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.000
        node_guid:                      00d2:c910:0001:b6c0
        sys_image_guid:                 00d2:c910:0001:b6c3
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       MT_0D20110009
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffe00
        max_qp:                         261056
        max_qp_wr:                      16351
        device_cap_flags:               0x00fc9c76
        max_sge:                        32
        max_sge_rd:                     0
        max_cq:                         65408
        max_cqe:                        4194303
        max_mr:                         524272
        max_pd:                         32764
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                4176896
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                1
        max_mcast_grp:                  8192
        max_mcast_qp_attach:            56
        max_total_mcast_qp_attach:      458752
        max_ah:                         0
        max_fmr:                        0
        max_srq:                        65472
        max_srq_wr:                     16383
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             15
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 86
                        port_lid:               73
                        port_lmc:               0x00
                        link_layer:             IB
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x02510868
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           128
                        gid_tbl_len:            128
                        subnet_timeout:         18
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           10.0 Gbps (4)
                        phys_state:             LINK_UP (5)
                        GID[  0]:
fe80:0000:0000:0000:00d2:c910:0001:b6c1

Related threads in the list:
http://www.open-mpi.org/community/lists/users/2009/07/9786.php
http://www.open-mpi.org/community/lists/users/2009/08/10456.php

Reply via email to