Hi, we (my collegue Andreas and me) are still trying to solve this problem. I have compiled some additional information, maybe somebody has an idea about what's going on.
OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit IB software: OFED 1.1 SM: OpenSM from OFED 1.1 uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL from OFED 1.1 doesn't change anything, I suppose it's the same code, at least roughly) Test program: Intel MPI Benchmarks Version 2.3 OpenMPI version: 1.2.1 Running OpenMPI directly over IB verbs (mpirun --mca btl self,sm,openib ...) works. Here's the output of ibv_devinfo and ifconfig for the two nodes on which tried to run the benchmark (ulimit -l is unlimited on both machines): ------------ 1st node ------------------------------- boris@pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo hca_id: mthca0 fw_ver: 1.2.0 node_guid: 0002:c902:0020:b528 sys_image_guid: 0002:c902:0020:b52b vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_0230000001 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 9 port_lmc: 0x00 boris@pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig ... ib0 Protokoll:UNSPEC Hardware Adresse 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet Adresse:192.168.0.14 Bcast:192.168.0.255 Maske:255.255.255.0 inet6 Adresse: fe80::202:c902:20:b529/64 Gültigkeitsbereich:Verbindung UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:67 errors:0 dropped:0 overruns:0 frame:0 TX packets:16 errors:0 dropped:2 overruns:0 carrier:0 Kollisionen:0 Sendewarteschlangenlänge:128 RX bytes:3752 (3.6 KiB) TX bytes:968 (968.0 b) ... ------------ 2nd node ------------------------------- boris@pd-05:~$ /opt/infiniband/bin/ibv_devinfo hca_id: mthca0 fw_ver: 1.2.0 node_guid: 0002:c902:0020:b4f4 sys_image_guid: 0002:c902:0020:b4f7 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_0230000001 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 10 port_lmc: 0x00 boris@pd-05:~$ /sbin/ifconfig ... ib0 Protokoll:UNSPEC Hardware Adresse 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet Adresse:192.168.0.15 Bcast:192.168.0.255 Maske:255.255.255.0 inet6 Adresse: fe80::202:c902:20:b4f5/64 Gültigkeitsbereich:Verbindung UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:67 errors:0 dropped:0 overruns:0 frame:0 TX packets:18 errors:0 dropped:2 overruns:0 carrier:0 Kollisionen:0 Sendewarteschlangenlänge:128 RX bytes:3752 (3.6 KiB) TX bytes:1088 (1.0 KiB) ... ------------------------------------------------------------------------- Here's the output from the failed run, with every DAT and DAPL debug output enabled: boris@pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host pd-04,pd-05 /work/boris/IMB_2.3/src/IMB-MPI1 pingpong DAT Registry: Started (dat_init) DAT Registry: static registry file </home/boris/dapl_on_dope_gamma3.2/doc/dat.conf> DAT Registry: token type string value <OpenIB-cma> DAT Registry: token type string value <u1.2> DAT Registry: token type string value <nonthreadsafe> DAT Registry: token type string value <default> DAT Registry: token type string value </home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so> DAT Registry: token type string value <mv_dapl.1.2> DAT Registry: token type string value <ib0 0> DAT Registry: token type string value <> DAT Registry: token type eor value <> DAT Registry: entry ia_name OpenIB-cma api_version type 0x0 major.minor 1.2 is_thread_safe 0 is_default 1 lib_path /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so provider_version id mv_dapl major.minor 1.2 ia_params ib0 0 DAT Registry: loading provider for OpenIB-cma DAT Registry: token type eof value <> DAT Registry: dat_registry_list_providers () called DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called DAT Registry: IA OpenIB-cma, trying to load library /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so DAPL: NOT Setting Loopback dapl_ib_init: DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0) open_hca: ib0 - 0x807cf28 ib_thread_init(17919) ib_thread_init: waiting for ib_thread ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12 DAT Registry: Started (dat_init) DAT Registry: static registry file </home/boris/dapl_on_dope_gamma3.2/doc/dat.conf> DAT Registry: token type string value <OpenIB-cma> DAT Registry: token type string value <u1.2> DAT Registry: token type string value <nonthreadsafe> DAT Registry: token type string value <default> DAT Registry: token type string value </home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so> DAT Registry: token type string value <mv_dapl.1.2> DAT Registry: token type string value <ib0 0> DAT Registry: token type string value <> DAT Registry: token type eor value <> DAT Registry: entry ia_name OpenIB-cma api_version type 0x0 major.minor 1.2 is_thread_safe 0 is_default 1 lib_path /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so provider_version id mv_dapl major.minor 1.2 ia_params ib0 0 DAT Registry: loading provider for OpenIB-cma DAT Registry: token type eof value <> DAT Registry: dat_registry_list_providers () called DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called DAT Registry: IA OpenIB-cma, trying to load library /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so ib_thread_init(17919) exit DAPL: NOT Setting Loopback dapl_ib_init: DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0) open_hca: ib0 - 0x807cf18 ib_thread_init(12326) ib_thread_init: waiting for ib_thread ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12 ib_thread_init(12326) exit getipaddr: family 2 port 0 addr 192.168.0.14 open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id 0002c9020020b529 open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128 ib_thread(17919) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0 ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12 cm=13 cq=d query_hca: ib0 AF_INET 192.168.0.14 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4 setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx 0x80a16d0 setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx 0x80a16d0 setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx 0x80a1648 dat_set_handle 0x80a1648 to 1 dat_get_ia_handle from 1 to 0x80a1648 pd_alloc: pd_handle=0x80a1928 dat_get_ia_handle from 1 to 0x80a1648 query_hca: ib0 AF_INET 192.168.0.14 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4 dat_get_ia_handle from 1 to 0x80a1648 cq_object_create: (0x80a1958,0x80a1a44) dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32 dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63 setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx 0x80a1958 dat_get_ia_handle from 1 to 0x80a1648 dat_get_ia_handle from 1 to 0x80a1648 setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Address already in use listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id 134904736) listen(conn=0x80a7a70 cm_id=134904736) dat_get_ia_handle from 1 to 0x80a1648 mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240 pv=0x0 mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0 lkey=0x72002700 rkey=0x72002700 priv=41000 dat_get_ia_handle from 1 to 0x80a1648 mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384 pv=0x0 mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0 lkey=0xf2002800 rkey=0xf2002800 priv=81000 getipaddr: family 2 port 0 addr 192.168.0.15 open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id 0002c9020020b4f5 open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128 ib_thread(12326) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0 ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12 cm=13 cq=d query_hca: ib0 AF_INET 192.168.0.15 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4 setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx 0x80a16c0 setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx 0x80a16c0 setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx 0x80a1638 dat_set_handle 0x80a1638 to 1 dat_get_ia_handle from 1 to 0x80a1638 pd_alloc: pd_handle=0x80a1918 dat_get_ia_handle from 1 to 0x80a1638 query_hca: ib0 AF_INET 192.168.0.15 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4 dat_get_ia_handle from 1 to 0x80a1638 cq_object_create: (0x80a1948,0x80a1a34) dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32 dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63 setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx 0x80a1948 dat_get_ia_handle from 1 to 0x80a1638 dat_get_ia_handle from 1 to 0x80a1638 setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied setup_listener Permission denied listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id 134904736) listen(conn=0x80a7a70 cm_id=134904736) dat_get_ia_handle from 1 to 0x80a1638 mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240 pv=0x0 mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0 lkey=0x60002400 rkey=0x60002400 priv=41000 dat_get_ia_handle from 1 to 0x80a1638 mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384 pv=0x0 mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0 lkey=0x60002500 rkey=0x60002500 priv=81000 #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part #--------------------------------------------------- # Date : Tue May 8 11:16:58 2007 # Machine : i686# System : Linux # Release : 2.6.18 # Version : #1 SMP Tue Nov 14 18:02:03 CET 2006 # # Minimum message length in bytes: 0 # Maximum message length in bytes: 16777216 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong dat_get_ia_handle from 1 to 0x80a1638 query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4 qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8 create_qp Address already in use ------------------------------------------------------------------------- The jobs hangs at this point. From the output of another simple test program I assume that it hangs inside of a receive operation. Of course, I have noticed the "Permission denied" messages, but I don't think that the probleme is there. These messages seem to come from RDMA CM when things are set up, but the execution continues from there on and I have seen these messages on successful DAPL runs, too. I'm not very familiar with RDMA CM, though, so I don't know the cause of these messages. That's a lot of information, I know, but it would be great if someone would have a look at it. Thanks in advance Boris Donald Kerr wrote: > I have not tried Open MPI uDAPL on Linux nor do I have access to a Linux > box so I am having a difficult time trying to find a way to help you > debug this issue. > > -DON > > Andreas Kuntze wrote: > >> On Linux you needn't initialise the dat registry. Your program prints: >> "provider 1: OpenIB-cma". I successfully tested INTEL MPI and mvapich2 >> with uDAPL . >> >> Andreas >> >> Donald Kerr wrote: >> >> >>> Andreas, >>> >>> I am going to guess at a minimum the interfaces are up and you can >>> ping them. On Solaris there is an additional step required and that >>> is initializing the dat registry. If "/usr/sbin/datadm -v" does not >>> show some driver output then you would need to run "/usr/sbin/datadm >>> -a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an >>> equivalent on Linux. >>> >>> Attached is a simple udapl program which will check if the interfaces >>> are available in the dat registry. >>> >>> -DON >>> >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | _ RWTH | Boris Bierbaum |_|_`_ | Lehrstuhl fuer Betriebssysteme | |_) _ | RWTH Aachen D-52056 Aachen |_)(_` | Tel: +49-241-80-27805 ._) | Fax: +49-241-80-22339
config.log.gz
Description: GNU Zip compressed data
ompi_info.out.gz
Description: GNU Zip compressed data