I did as you said, but got an error: node1$ export MXM_IB_PORTS=mlx4_0:1 node1$ ./mxm_perftest Waiting for connection... Accepted connection from 10.65.0.253 [1432576262.370195] [node153:35388:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem. Failed to create endpoint: No such device
node2$ export MXM_IB_PORTS=mlx4_0:1 node2$ ./mxm_perftest node1 -t send_lat [1432576262.367523] [node158:99366:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem. Failed to create endpoint: No such device Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il>: >scif is a OFA device from Intel. >can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry > >On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>Hi, Mike, >>that is what i have: >>$ echo $LD_LIBRARY_PATH | tr ":" "\n" >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib >> >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib >> >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib >> >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib >> +intel compiler paths >> >>$ echo $OPAL_PREFIX >> >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >> >>I don't use LD_PRELOAD. >> >>In the attached file(ompi_info.out) you will find the output of ompi_info -l >>9 command. >> >>P.S . >>node1 $ ./mxm_perftest >>node2 $ ./mxm_perftest node1 -t send_lat >>[1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN Could not >>open the KNEM device file $t /dev/knem : No such file or directory. Won't use >>knem. ( I don't have knem) >>[1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN skipping >>device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device >> (???) >>Failed to create endpoint: No such device >> >>$ ibv_devinfo >>hca_id: mlx4_0 >> transport: InfiniBand (0) >> fw_ver: 2.10.600 >> node_guid: 0002:c903:00a1:13b0 >> sys_image_guid: 0002:c903:00a1:13b3 >> vendor_id: 0x02c9 >> vendor_part_id: 4099 >> hw_ver: 0x0 >> board_id: MT_1090120019 >> phys_port_cnt: 2 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 4096 (5) >> active_mtu: 4096 (5) >> sm_lid: 1 >> port_lid: 83 >> port_lmc: 0x00 >> >> port: 2 >> state: PORT_DOWN (1) >> max_mtu: 4096 (5) >> active_mtu: 4096 (5) >> sm_lid: 0 >> port_lid: 0 >> port_lmc: 0x00 >> >>Best regards, >>Timur. >> >> >>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < >>mi...@dev.mellanox.co.il >: >>>Hi Timur, >>>seems that yalla component was not found in your OMPI tree. >>>can it be that your mpirun is not from hpcx? Can you please check >>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the >>>right mpirun? >>> >>>Also, could you please check that yalla is present in the ompi_info -l 9 >>>output? >>> >>>Thanks >>> >>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov < tismagi...@mail.ru > >>>wrote: >>>>I can password-less ssh to all nodes: >>>>base$ ssh node1 >>>>node1$ssh node2 >>>>Last login: Mon May 25 18:41:23 >>>>node2$ssh node3 >>>>Last login: Mon May 25 16:25:01 >>>>node3$ssh node4 >>>>Last login: Mon May 25 16:27:04 >>>>node4$ >>>> >>>>Is this correct? >>>> >>>>In ompi-1.9 i do not have no-tree-spawn problem. >>>> >>>> >>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < r...@open-mpi.org >>>>>: >>>> >>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates >>>>>that you don’t have password-less ssh authorized between the compute nodes >>>>> >>>>> >>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>>>>>Hello! >>>>>> >>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>>>>>OFED-1.5.4.1; >>>>>>CentOS release 6.2; >>>>>>infiniband 4x FDR >>>>>> >>>>>> >>>>>> >>>>>>I have two problems: >>>>>>1. I can not use mxm : >>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 >>>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>A requested component was not found, or was unable to be opened. This >>>>>> >>>>>>means that this component is either not installed or is unable to be >>>>>> >>>>>>used on your system (e.g., sometimes this means that shared libraries >>>>>> >>>>>>that the component requires are unable to be found/loaded). Note that >>>>>> >>>>>>Open MPI stopped checking at the first component that it did not find. >>>>>> >>>>>> >>>>>> >>>>>>Host: node14 >>>>>> >>>>>>Framework: pml >>>>>> >>>>>>Component: yalla >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>It looks like MPI_INIT failed for some reason; your parallel process is >>>>>> >>>>>>likely to abort. There are many reasons that a parallel process can >>>>>> >>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>environment >>>>>>problems. This failure appears to be an internal failure; here's some >>>>>> >>>>>>additional information (which may only be relevant to an Open MPI >>>>>> >>>>>>developer): >>>>>> >>>>>> >>>>>> >>>>>> mca_pml_base_open() failed >>>>>> >>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>[node28:102377] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>> and not able to guarantee that all other processes were killed! >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>[node29:105600] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>> and not able to guarantee that all other processes were killed! >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>[node5:102409] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>>and not able to guarantee that all other processes were killed! >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>[node14:85284] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>>and not able to guarantee that all other processes were killed! >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>Primary job terminated normally, but 1 process returned >>>>>> >>>>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>mpirun detected that one or more processes exited with non-zero status, >>>>>>thus causing >>>>>>the job to be terminated. The first process to do so was: >>>>>> >>>>>> >>>>>> >>>>>> Process name: [[9372,1],2] >>>>>> Exit code: 1 >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>[login:08295] 3 more processes have sent help message help-mca-base.txt / >>>>>>find-available:not-valid >>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>>>all help / error messages >>>>>>[login:08295] 3 more processes have sent help message help-mpi-runtime / >>>>>>mpi_init:startup:internal-failur >>>>>>e >>>>>> >>>>>> >>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca >>>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>A requested component was not found, or was unable to be opened. This >>>>>> >>>>>>means that this component is either not installed or is unable to be >>>>>> >>>>>>used on your system (e.g., sometimes this means that shared libraries >>>>>> >>>>>>that the component requires are unable to be found/loaded). Note that >>>>>> >>>>>>Open MPI stopped checking at the first component that it did not find. >>>>>> >>>>>> >>>>>> >>>>>>Host: node5 >>>>>> >>>>>>Framework: pml >>>>>> >>>>>>Component: yalla >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>[node5:102449] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>>and not able to guarantee that all other processes were killed! >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>It looks like MPI_INIT failed for some reason; your parallel process is >>>>>> >>>>>>likely to abort. There are many reasons that a parallel process can >>>>>> >>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>environment >>>>>>problems. This failure appears to be an internal failure; here's some >>>>>> >>>>>>additional information (which may only be relevant to an Open MPI >>>>>> >>>>>>developer): >>>>>> >>>>>> >>>>>> >>>>>> mca_pml_base_open() failed >>>>>> >>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>Primary job terminated normally, but 1 process returned >>>>>> >>>>>>a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> >>>>>>------------------------------------------------------- >>>>>> >>>>>>*** An error occurred in MPI_Init >>>>>> >>>>>>*** on a NULL communicator >>>>>> >>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>> >>>>>>*** and potentially your MPI job) >>>>>> >>>>>>[node14:85325] Local abort before MPI_INIT completed successfully; not >>>>>>able to aggregate error messages, >>>>>>and not able to guarantee that all other processes were killed! >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>mpirun detected that one or more processes exited with non-zero status, >>>>>>thus causing >>>>>>the job to be terminated. The first process to do so was: >>>>>> >>>>>> >>>>>> >>>>>> Process name: [[9619,1],0] >>>>>> >>>>>> Exit code: 1 >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>[login:08552] 1 more process has sent help message help-mca-base.txt / >>>>>>find-available:not-valid >>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>>>all help / error messages >>>>>> >>>>>>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line : >>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn' >>>>>> >>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export >>>>>>OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >>>>>> ; export PA >>>>>>TH ; >>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >>>>>> ; expor >>>>>>t DYLD_LIBRARY_PATH ; >>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig >>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es >>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca >>>>>>orte_parent_uri "625606656.1;tc >>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri >>>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca >>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s >>>>>>pawn' >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>ORTE was unable to reliably start one or more daemons. >>>>>> >>>>>>This usually is caused by: >>>>>> >>>>>> >>>>>> >>>>>>* not finding the required libraries and/or binaries on >>>>>> >>>>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>>>> >>>>>> settings, or configure OMPI with --enable-orterun-prefix-by-default >>>>>> >>>>>> >>>>>> >>>>>>* lack of authority to execute on one or more specified nodes. >>>>>> >>>>>> Please verify your allocation and authorities. >>>>>> >>>>>> >>>>>> >>>>>>* the inability to write startup files into /tmp >>>>>>(--tmpdir/orte_tmpdir_base). >>>>>> Please check with your sys admin to determine the correct location to >>>>>>use. >>>>>> >>>>>> >>>>>>* compilation of the orted with dynamic libraries when static are >>>>>>required >>>>>> (e.g., on Cray). Please check your configure cmd line and consider >>>>>>using >>>>>> one of the contrib/platform definitions for your system type. >>>>>> >>>>>> >>>>>> >>>>>>* an inability to create a connection back to mpirun due to a >>>>>> >>>>>> lack of common network interfaces and/or no route found between >>>>>> >>>>>> them. Please check network connectivity (including firewalls >>>>>> >>>>>> and network routing requirements). >>>>>> >>>>>>-------------------------------------------------------------------------- >>>>>> >>>>>>mpirun: abort is already in progress...hit ctrl-c again to forcibly >>>>>>terminate >>>>>> >>>>>> >>>>>>Thank you for your comments. >>>>>> >>>>>>Best regards, >>>>>>Timur. >>>>>> >>>>>> >>>>>> >>>>>>_______________________________________________ >>>>>>users mailing list >>>>>>us...@open-mpi.org >>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>Link to this post: >>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >>>>> >>>> >>>> >>>> >>>> >>>>_______________________________________________ >>>>users mailing list >>>>us...@open-mpi.org >>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>Link to this post: >>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php >>> >>> >>> >>>-- >>> >>>Kind Regards, >>> >>>M. >> >> > > > >-- > >Kind Regards, > >M. ----------------------------------------------------------------------