I'm forwarding this to the OpenFabrics general list -- as it just came up the other day, we know that Open MPI's UDAPL support works on Solaris, but we have done little/no testing of it on OFED (I personally know almost nothing about UDPAL).

Can the UDAPL OFED wizards shed any light on the error messages that are listed below? In particular, these seem to be worrysome:

 setup_listener Permission denied
 setup_listener Address already in use
and
 create_qp Address already in use

Thanks...


On May 8, 2007, at 5:37 AM, Boris Bierbaum wrote:

Hi,

we (my collegue Andreas and me) are still trying to solve this problem. I have compiled some additional information, maybe somebody has an idea
about what's going on.

OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit
IB software: OFED 1.1
SM: OpenSM from OFED 1.1
uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL from OFED 1.1 doesn't change anything, I suppose it's the same code, at least
roughly)
Test program: Intel MPI Benchmarks Version 2.3
OpenMPI version: 1.2.1

Running OpenMPI directly over IB verbs (mpirun --mca btl self,sm,openib
...) works. Here's the output of ibv_devinfo and ifconfig for the two
nodes on which tried to run the benchmark (ulimit -l is unlimited on
both machines):

------------ 1st node -------------------------------

boris@pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
        fw_ver:                         1.2.0
        node_guid:                      0002:c902:0020:b528
        sys_image_guid:                 0002:c902:0020:b52b
        vendor_id:                      0x02c9
        vendor_part_id:                 25204
        hw_ver:                         0xA0
        board_id:                       MT_0230000001
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               9
                        port_lmc:               0x00

boris@pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig

...

ib0       Protokoll:UNSPEC  Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet Adresse:192.168.0.14  Bcast:192.168.0.255
Maske:255.255.255.0
          inet6 Adresse: fe80::202:c902:20:b529/64
Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:67 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16 errors:0 dropped:2 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:128
          RX bytes:3752 (3.6 KiB)  TX bytes:968 (968.0 b)

...

------------ 2nd node -------------------------------

boris@pd-05:~$  /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
        fw_ver:                         1.2.0
        node_guid:                      0002:c902:0020:b4f4
        sys_image_guid:                 0002:c902:0020:b4f7
        vendor_id:                      0x02c9
        vendor_part_id:                 25204
        hw_ver:                         0xA0
        board_id:                       MT_0230000001
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               10
                        port_lmc:               0x00

boris@pd-05:~$ /sbin/ifconfig

...

ib0       Protokoll:UNSPEC  Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet Adresse:192.168.0.15  Bcast:192.168.0.255
Maske:255.255.255.0
          inet6 Adresse: fe80::202:c902:20:b4f5/64
Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:67 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18 errors:0 dropped:2 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:128
          RX bytes:3752 (3.6 KiB)  TX bytes:1088 (1.0 KiB)


...

---------------------------------------------------------------------- ---


Here's the output from the failed run, with every DAT and DAPL debug
output enabled:



boris@pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x
DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host pd-04,pd-05
/work/boris/IMB_2.3/src/IMB-MPI1 pingpong
DAT Registry: Started (dat_init)
DAT Registry: static registry file
</home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>

DAT Registry: token
 type  string
 value <OpenIB-cma>


DAT Registry: token
 type  string
 value <u1.2>


DAT Registry: token
 type  string
 value <nonthreadsafe>


DAT Registry: token
 type  string
 value <default>


DAT Registry: token
 type  string
 value
</home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ libdapl_openib_cma.so>


DAT Registry: token
 type  string
 value <mv_dapl.1.2>


DAT Registry: token
 type  string
 value <ib0 0>


DAT Registry: token
 type  string
 value <>


DAT Registry: token
 type  eor
 value <>


DAT Registry: entry
 ia_name OpenIB-cma
 api_version
     type 0x0
     major.minor 1.2
 is_thread_safe 0
 is_default 1
 lib_path
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ libdapl_openib_cma.so
 provider_version
     id mv_dapl
     major.minor 1.2
 ia_params ib0 0

DAT Registry: loading provider for OpenIB-cma

DAT Registry: token
 type  eof
 value <>

DAT Registry: dat_registry_list_providers () called
DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
DAT Registry: IA OpenIB-cma, trying to load library
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ libdapl_openib_cma.so
DAPL: NOT Setting Loopback
 dapl_ib_init:
DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
 open_hca: ib0 - 0x807cf28
 ib_thread_init(17919)
 ib_thread_init: waiting for ib_thread
 ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12
DAT Registry: Started (dat_init)
DAT Registry: static registry file
</home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>

DAT Registry: token
 type  string
 value <OpenIB-cma>


DAT Registry: token
 type  string
 value <u1.2>


DAT Registry: token
 type  string
 value <nonthreadsafe>


DAT Registry: token
 type  string
 value <default>


DAT Registry: token
 type  string
 value
</home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ libdapl_openib_cma.so>


DAT Registry: token
 type  string
 value <mv_dapl.1.2>


DAT Registry: token
 type  string
 value <ib0 0>


DAT Registry: token
 type  string
 value <>


DAT Registry: token
 type  eor
 value <>


DAT Registry: entry
 ia_name OpenIB-cma
 api_version
     type 0x0
     major.minor 1.2
 is_thread_safe 0
 is_default 1
 lib_path
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ libdapl_openib_cma.so
 provider_version
     id mv_dapl
     major.minor 1.2
 ia_params ib0 0

DAT Registry: loading provider for OpenIB-cma

DAT Registry: token
 type  eof
 value <>

DAT Registry: dat_registry_list_providers () called
DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
DAT Registry: IA OpenIB-cma, trying to load library
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ libdapl_openib_cma.so
 ib_thread_init(17919) exit
DAPL: NOT Setting Loopback
 dapl_ib_init:
DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
 open_hca: ib0 - 0x807cf18
 ib_thread_init(12326)
 ib_thread_init: waiting for ib_thread
 ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12
 ib_thread_init(12326) exit
 getipaddr: family 2 port 0 addr 192.168.0.14
 open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id
0002c9020020b529
 open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128
 ib_thread(17919) poll_event:  async=0x1 pipe=0x1 cm=0x0 cq=0x0
ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12 cm=13 cq=d
 query_hca: ib0 AF_INET  192.168.0.14
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4 setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx 0x80a16d0 setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx 0x80a16d0 setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx 0x80a1648
dat_set_handle 0x80a1648 to 1
dat_get_ia_handle from 1 to 0x80a1648
 pd_alloc: pd_handle=0x80a1928
dat_get_ia_handle from 1 to 0x80a1648
 query_hca: ib0 AF_INET  192.168.0.14
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
dat_get_ia_handle from 1 to 0x80a1648
 cq_object_create: (0x80a1958,0x80a1a44)
dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32
dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63
 setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx
0x80a1958
dat_get_ia_handle from 1 to 0x80a1648
dat_get_ia_handle from 1 to 0x80a1648
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Address already in use
listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id 134904736)
 listen(conn=0x80a7a70 cm_id=134904736)
dat_get_ia_handle from 1 to 0x80a1648
mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240 pv=0x0
 mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0
lkey=0x72002700 rkey=0x72002700 priv=41000
dat_get_ia_handle from 1 to 0x80a1648
mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384 pv=0x0
 mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0
lkey=0xf2002800 rkey=0xf2002800 priv=81000
 getipaddr: family 2 port 0 addr 192.168.0.15
 open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id
0002c9020020b4f5
 open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128
 ib_thread(12326) poll_event:  async=0x1 pipe=0x1 cm=0x0 cq=0x0
ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12 cm=13 cq=d
 query_hca: ib0 AF_INET  192.168.0.15
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4 setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx 0x80a16c0 setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx 0x80a16c0 setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx 0x80a1638
dat_set_handle 0x80a1638 to 1
dat_get_ia_handle from 1 to 0x80a1638
 pd_alloc: pd_handle=0x80a1918
dat_get_ia_handle from 1 to 0x80a1638
 query_hca: ib0 AF_INET  192.168.0.15
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
dat_get_ia_handle from 1 to 0x80a1638
 cq_object_create: (0x80a1948,0x80a1a34)
dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32
dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63
 setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx
0x80a1948
dat_get_ia_handle from 1 to 0x80a1638
dat_get_ia_handle from 1 to 0x80a1638
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id 134904736)
 listen(conn=0x80a7a70 cm_id=134904736)
dat_get_ia_handle from 1 to 0x80a1638
mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240 pv=0x0
 mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0
lkey=0x60002400 rkey=0x60002400 priv=41000
dat_get_ia_handle from 1 to 0x80a1638
mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384 pv=0x0
 mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0
lkey=0x60002500 rkey=0x60002500 priv=81000
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date       : Tue May  8 11:16:58 2007
# Machine    : i686# System     : Linux
# Release    : 2.6.18
# Version    : #1 SMP Tue Nov 14 18:02:03 CET 2006

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   16777216
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
dat_get_ia_handle from 1 to 0x80a1638
 query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4
 qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8
 create_qp Address already in use

---------------------------------------------------------------------- ---

The jobs hangs at this point. From the output of another simple test
program I assume that it hangs inside of a receive operation. Of course, I have noticed the "Permission denied" messages, but I don't think that
the probleme is there. These messages seem to come from RDMA CM when
things are set up, but the execution continues from there on and I have seen these messages on successful DAPL runs, too. I'm not very familiar
with RDMA CM, though, so I don't know the cause of these messages.

That's a lot of information, I know, but it would be great if someone
would have a look at it.

Thanks in advance
Boris



Donald Kerr wrote:
I have not tried Open MPI uDAPL on Linux nor do I have access to a Linux
box so I am having a difficult time trying to find a way to help you
debug this issue.

-DON

Andreas Kuntze wrote:

On Linux you needn't initialise the dat registry. Your program prints: "provider 1: OpenIB-cma". I successfully tested INTEL MPI and mvapich2
with uDAPL .

Andreas

Donald Kerr wrote:


Andreas,

I am going to guess at a minimum the interfaces are up and you can
ping them. On Solaris there is an additional step required and that
is initializing the dat registry. If "/usr/sbin/datadm -v" does not
show some driver output then you would need to run "/usr/sbin/ datadm
-a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an
equivalent on Linux.

Attached is a simple udapl program which will check if the interfaces
are available in the dat registry.

-DON



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339
<config.log.gz>
<ompi_info.out.gz>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


Reply via email to