Also, what MOFED/OFED version do you have? MXM is compiled per OFED/MOFED version, is there match between active ofed and mxm.rpm selected?
On Thu, Jan 17, 2013 at 4:09 PM, Francesco Simula < francesco.sim...@roma1.infn.it> wrote: > I tried building from OMPI 1.6.3 tarball with the following ./configure: > ./configure > --prefix=/apotto/home1/**homedirs/fsimula/Lavoro/**openmpi-1.6.3/install/ > \ > --disable-mpi-io \ > --disable-io-romio \ > --enable-dependency-tracking \ > --without-slurm \ > --with-platform=optimized \ > --disable-mpi-f77 \ > --disable-mpi-f90 \ > --with-openib \ > --disable-static \ > --enable-shared \ > --disable-vt \ > --enable-pty-support \ > --enable-mca-no-build=btl-**ofud,pml-bfo \ > --with-mxm=/opt/mellanox/mxm \ > --with-mxm-libdir=/opt/**mellanox/mxm/lib > > As you can see from the last two lines, I want to enable the MXM transport > layer on a cluster made of SuperMicro X8DTG-D boards with dual Xeons and > Mellanox MT26428 HCAs; the OS is CentOS 5.8. > > I tried with two different .rpm's for MXM, either > 'mxm-1.1.ad085ef-1.x86_64-**centos5u7.rpm' taken from here: > http://www.mellanox.com/**downloads/hpc/mxm/v1.1/mxm-**latest.tar<http://www.mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar> > > and 'mxm-1.5.f583875-1.x86_64-**centos5u7.rpm' taken from here: > http://www.mellanox.com/**downloads/hpc/mxm/v1.5/mxm-**latest.tar<http://www.mellanox.com/downloads/hpc/mxm/v1.5/mxm-latest.tar> > > With both, even if the compilation concludes successfully, a simple test > (osu_bw from the OSU Micro-Benchmarks 3.8) fails with the sort of message > reported below; the lines: > > rdma_dev.c:122 MXM DEBUG Port 1 on mlx4_0 has a link layer different from > IB. Skipping it > rdma_dev.c:155 MXM ERROR An active IB port on a Mellanox device, with lid > [any] gid [any] not found > > make it seem like it cannot access the HW for the HCA: is that so? The > very same test works when using '-mca pml ob1' (thus using the openib BTL). > > I'm quite ready to start pulling my hair; any suggestions? > > The output of /usr/bin/ibv_devinfo for the two cluster nodes follows: > [cut] > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.7.000 > node_guid: 0025:90ff:ff07:0ac4 > sys_image_guid: 0025:90ff:ff07:0ac7 > vendor_id: 0x02c9 > vendor_part_id: 26428 > hw_ver: 0xB0 > board_id: SM_1061000001000 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 4 > port_lid: 6 > port_lmc: 0x00 > [/cut] > > [cut] > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.7.000 > node_guid: 0025:90ff:ff07:0acc > sys_image_guid: 0025:90ff:ff07:0acf > vendor_id: 0x02c9 > vendor_part_id: 26428 > hw_ver: 0xB0 > board_id: SM_1061000001000 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 4 > port_lid: 8 > port_lmc: 0x00 > [/cut] > > The complete output of the failing test follows: > > [fsimula@agape5 osu-micro-benchmarks-3.8]$ mpirun -x MXM_LOG_LEVEL=poll > -mca pml cm -mca mtl_mxm_np 1 -np 2 -host agape4,agape5 > install/libexec/osu-micro-**benchmarks/mpi/pt2pt/osu_bw H H > [1358430343.266782] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > [1358430343.266815] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_HANDLE_ERRORS=bt > [1358430343.266826] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_GDB_PATH=/usr/bin/gdb > [1358430343.266838] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_DUMP_SIGNO=1 > [1358430343.266851] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_DUMP_LEVEL=conn > [1358430343.266924] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_ASYNC_MODE=THREAD > [1358430343.266936] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_TIME_ACCURACY=0.1 > [1358430343.266956] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_PTLS=self,shm,rdma > [1358430343.267249] [agape5:8596 :0] mpool.c:265 MXM DEBUG mpool > 'ptl_self_recv_ev': allocated chunk 0xc075f40 of 96016 bytes with 1000 > elements > [1358430343.267308] [agape5:8596 :0] mpool.c:156 MXM DEBUG mpool > 'ptl_self_recv_ev': align 16, maxelems 1000, elemsize 88, padding 8 > [1358430343.267316] [agape5:8596 :0] self.c:410 MXM DEBUG Created > ptl_self > [1358430343.267333] [agape5:8596 :0] shm_ptl.c:56 MXM DEBUG Created > ptl_shm > [1358430343.268457] [agape5:8596 :0] rdma_ptl.c:65 MXM TRACE Got 1 IB > devices > [1358430343.268640] [agape5:8596 :0] rdma_ptl.c:112 MXM DEBUG added > device mlx4_0 > [1358430343.268665] [agape5:8596 :0] memreg.c:187 MXM TRACE Created > memory registration cache on 1 devices > [1358430343.268676] [agape5:8596 :0] rdma_ptl.c:133 MXM DEBUG Created > ptl_rdma > [1358430343.268689] [agape5:8596 :0] event.c:353 MXM FUNC > mxm_event_init(event=**0x2b73e0ee3038 mode=2 time_accuracy=160000000) > [1358430343.268698] [agape5:8596 :0] timerq.c:55 MXM FUNC > mxm_timerq_init(timerq=**0x2b73e0ee3060 accuracy=160000000 > max_interval=1600000000) > [1358430343.268706] [agape5:8596 :0] event.c:292 MXM FUNC > mxm_event_add_thread_context(**thread=0x2b73e0ee30d0) > [1358430343.268732] [agape5:8596 :0] event.c:198 MXM FUNC > mxm_set_fd_nonblock(fd=10) > [1358430343.268741] [agape5:8596 :0] event.c:198 MXM FUNC > mxm_set_fd_nonblock(fd=11) > [1358430343.268841] [agape5:8596 :0] mxm.c:162 MXM INFO context > 0x2b73e0ee3010 created > [1358430343.269090] [agape5:8596 :1] event.c:41 MXM FUNC > __call_handler(handler->cb=**0x2b73e0ab28a0 handler->arg=0x2b73e0ee3038) > [1358430343.269104] [agape5:8596 :1] timerq.c:88 MXM FUNC > mxm_timerq_sweep(timerq=**0x2b73e0ee3060 current_time=568595527963578) > [1358430343.274685] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_ENABLE_HUGETLB=1 > [1358430343.274700] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_ENABLE_TIMEOUTS=y > [1358430343.274709] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_ACK_TIMEOUT=0.3 > [1358430343.274721] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_POLL_INTERVAL=0.1 > [1358430343.274742] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_WINDOW_SIZE=512 > [1358430343.274755] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_TX_BATCH=1 > [1358430343.274764] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_CQ_MODERATION=64 > [1358430343.274773] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_DRAIN_CQ=n > [1358430343.274782] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_RNDV_THRESH=65536 > [1358430343.274791] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_ZCOPY_THRESH=2040 > [1358430343.274815] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_RESIZE_CQ=y > [1358430343.274826] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_MTU=65536 > [1358430343.274836] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_RX_QUEUE_LEN=16000 > [1358430343.274849] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_TX_QUEUE_LEN=64 > [1358430343.274859] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_RX_MAX_BUFFERS=128000 > [1358430343.274877] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_TX_MAX_BUFFERS=8192 > [1358430343.274887] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_RX_DROP_RATE=0 > [1358430343.274896] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_ENABLE_NAK=y > [1358430343.274904] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_RX_FILL_THRESH=0.6 > [1358430343.274915] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_UD_TX_MAX_INLINE=128 > [1358430343.274925] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_SHM_RX_MAX_BUFFERS=2000 > [1358430343.274941] [agape5:8596 :0] config_parser.c:168 MXM DEBUG > default: MXM_RDMA_ALLOC=1 > [1358430343.274968] [agape5:8596 :0] ep.c:36 MXM FUNC > mxm_ep_create(context=**0x2b73e0ee3010) > [1358430343.274984] [agape5:8596 :0] self.c:380 MXM DEBUG Created > ptl_self EP(rank=3767085072) > [1358430343.275028] [agape5:8596 :0] shm_queue.c:230 MXM DEBUG shm_ep=0, > shmid=6815750 > [1358430343.275072] [agape5:8596 :0] mpool.c:265 MXM DEBUG mpool > 'shm_ep_recv': allocated chunk 0x2aaaadd0c010 of 65824016 bytes with 2000 > elements > [1358430343.278550] [agape5:8596 :0] mpool.c:156 MXM DEBUG mpool > 'shm_ep_recv': align 16, maxelems 2000, elemsize 32904, padding 8 > [1358430343.278584] [agape5:8596 :0] timerq.c:139 MXM FUNC > mxm_timer_schedule(timerq=**0x2b73e0ee3060 timer=0xc029538 > expiration=568595550657300) > [1358430343.278594] [agape5:8596 :0] timerq.c:43 MXM FUNC > mxm_timerq_insert_timer(put timer 0xc029538 expiration 568595550657300 in > slot 10) > [1358430343.278608] [agape5:8596 :0] timerq.c:145 MXM TRACE added > timer 0xc029538 expiration 568595550657300 interval 160000000 > [1358430343.278617] [agape5:8596 :0] shm_ep.c:176 MXM DEBUG Created > ptl_shm EP (rank=0, ctx_id=1) > [1358430343.278641] [agape5:8596 :0] rdma_ep.c:317 MXM FUNC > mxm_rdma_ep_create() > [1358430343.278722] [agape5:8596 :0] rdma_dev.c:194 MXM FUNC > mxm_rdma_dev_init(dev=**0xc0b3f00) > [1358430343.278924] [agape5:8596 :0] rdma_dev.c:122 MXM DEBUG Port 1 on > mlx4_0 has a link layer different from IB. Skipping it > [1358430343.278939] [agape5:8596 :0] rdma_dev.c:155 MXM ERROR An active > IB port on a Mellanox device, with lid [any] gid [any] not found > [1358430343.278954] [agape5:8596 :0] timerq.c:150 MXM FUNC > mxm_timer_cancel(timerq=**0x2b73e0ee3060 timer=0xc029538) > [1358430343.279454] [agape5:8596 :0] mpool.c:184 MXM DEBUG mpool > 'shm_ep_recv': destroyed > [1358430343.279466] [agape5:8596 :0] self.c:287 MXM FUNC > mxm_self_ep_destroy(ep=**0xc094600) > ------------------------------**------------------------------** > -------------- > MXM was unable to create an endpoint. Please make sure that the network > link is > active on the node and the hardware is functioning. > > Error: No such device > > ------------------------------**------------------------------** > -------------- > [1358430343.287336] [agape5:8596 :0] event.c:400 MXM FUNC > mxm_event_cleanup(event=**0x2b73e0ee3038) > [1358430343.287348] [agape5:8596 :0] event.c:338 MXM FUNC > mxm_event_remove_thread_**context(thread=0x2b73e0ee30d0) > [1358430343.287355] [agape5:8596 :0] event.c:145 MXM FUNC > mxm_event_thread_wakeup() > [1358430343.371011] [agape5:8596 :0] timerq.c:76 MXM FUNC > mxm_timerq_cleanup(timerq=**0x2b73e0ee3060) > [1358430343.371030] [agape5:8596 :0] memreg.c:194 MXM TRACE Destroying > memory registration cache > [1358430343.371129] [agape5:8596 :0] shm_ptl.c:34 MXM FUNC > ptl_shm_destroy(ptl=0xc0729b0) > [1358430343.371139] [agape5:8596 :0] self.c:340 MXM FUNC > mxm_self_destroy(ptl=**0xc0699a0) > [1358430343.371148] [agape5:8596 :0] mpool.c:184 MXM DEBUG mpool > 'ptl_self_recv_ev': destroyed > [1358430343.371156] [agape5:8596 :0] mxm.c:197 MXM INFO context > 0x2b73e0ee3010 destroyed > ------------------------------**------------------------------** > -------------- > No available pml components were found! > > This means that there are no components of this type installed on your > system or all the components reported that they could not be used. > > This is a fatal error; your MPI process is likely to abort. Check the > output of the "ompi_info" command and ensure that components of this > type are available on your system. You may also wish to check the > value of the "component_path" MCA parameter and ensure that it has at > least one directory that contains valid MCA components. > ------------------------------**------------------------------** > -------------- > [agape5:08596] PML cm cannot be selected > ------------------------------**------------------------------** > -------------- > mpirun has exited due to process rank 1 with PID 8596 on > node agape5 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > ------------------------------**------------------------------** > -------------- > > Regards, > Francesco > > ______________________________**_________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users> >