1. mxm_perf_test - OK. 2. no_tree_spawn - OK. 3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I use prebuild ompi-1.8.5 from hpcx-v1.3.330) 3.a) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5,node153 --mca pml cm --mca mtl mxm --prefix $HPCX_MPI_DIR ./hello -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: node153 Framework: mtl Component: mxm -------------------------------------------------------------------------- [node5:113560] PML cm cannot be selected -------------------------------------------------------------------------- No available pml components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -------------------------------------------------------------------------- [node153:44440] PML cm cannot be selected ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[43917,1],0] Exit code: 1 -------------------------------------------------------------------------- [login:110455] 1 more process has sent help message help-mca-base.txt / find-available:not-valid [login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [login:110455] 1 more process has sent help message help-mca-base.txt / find-available:none-found 3.b) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix $HPCX_MPI_DIR ./hello -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: node153 Framework: pml Component: yalla -------------------------------------------------------------------------- *** An error occurred in MPI_Init -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node153:43979] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[44992,1],1] Exit code: 1 --------------------------------------------------------------------------
host:$ echo $HPCX_MPI_DIR /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8 host:$ ompi_info | grep pml MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5) MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5) MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5) MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5) host: tests$ ompi_info | grep mtl MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5) P.S. possible error in the FAQ? (http://www.open-mpi.org/faq/?category=openfabrics#mxm) 47. Does Open MPI support MXM ? ............ NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 and above ........... But here we have(or not...) yalla in ompi 1.8.5 Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il>: >Hi Timur, > >Here it goes: > >wget >ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz > >Please let me know if it works for you and will add 1.5.4.1 mofed to the >default distribution list. > >M > > >On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>Thanks a lot . >> >>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < >>mi...@dev.mellanox.co.il >: >> >>>will send u the link tomorrow. >>> >>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov < tismagi...@mail.ru > >>>wrote: >>>>Where can i find MXM for ofed 1.5.4.1? >>>> >>>> >>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < >>>>mi...@dev.mellanox.co.il >: >>>> >>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed 1.5.3 >>>>> >>>>>seems like ABI issue between ofed versions >>>>> >>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>wrote: >>>>>>I did as you said, but got an error: >>>>>> >>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>node1$ ./mxm_perftest >>>>>> >>>>>>Waiting for connection... >>>>>> >>>>>>Accepted connection from 10.65.0.253 >>>>>> >>>>>>[1432576262.370195] [node153:35388:0] shm.c:65 MXM WARN Could >>>>>>not open the KNEM device file at /dev/knem : No such file or directory. >>>>>>Won't use knem. >>>>>>Failed to create endpoint: No such device >>>>>> >>>>>> >>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>node2$ ./mxm_perftest node1 -t send_lat >>>>>> >>>>>>[1432576262.367523] [node158:99366:0] shm.c:65 MXM WARN Could >>>>>>not open the KNEM device file at /dev/knem : No such file or directory. >>>>>>Won't use knem. >>>>>>Failed to create endpoint: No such device >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < >>>>>>mi...@dev.mellanox.co.il >: >>>>>>>scif is a OFA device from Intel. >>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry >>>>>>> >>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>wrote: >>>>>>>>Hi, Mike, >>>>>>>>that is what i have: >>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n" >>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib >>>>>>>> >>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib >>>>>>>> >>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib >>>>>>>> >>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib >>>>>>>> +intel compiler paths >>>>>>>> >>>>>>>>$ echo $OPAL_PREFIX >>>>>>>> >>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >>>>>>>> >>>>>>>>I don't use LD_PRELOAD. >>>>>>>> >>>>>>>>In the attached file(ompi_info.out) you will find the output of >>>>>>>>ompi_info -l 9 command. >>>>>>>> >>>>>>>>P.S . >>>>>>>>node1 $ ./mxm_perftest >>>>>>>>node2 $ ./mxm_perftest node1 -t send_lat >>>>>>>>[1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN >>>>>>>>Could not open the KNEM device file $t /dev/knem : No such file or >>>>>>>>directory. Won't use knem. ( I don't have knem) >>>>>>>>[1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN >>>>>>>>skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox >>>>>>>>device (???) >>>>>>>>Failed to create endpoint: No such device >>>>>>>> >>>>>>>>$ ibv_devinfo >>>>>>>>hca_id: mlx4_0 >>>>>>>> transport: InfiniBand (0) >>>>>>>> fw_ver: 2.10.600 >>>>>>>> node_guid: 0002:c903:00a1:13b0 >>>>>>>> sys_image_guid: 0002:c903:00a1:13b3 >>>>>>>> vendor_id: 0x02c9 >>>>>>>> vendor_part_id: 4099 >>>>>>>> hw_ver: 0x0 >>>>>>>> board_id: MT_1090120019 >>>>>>>> phys_port_cnt: 2 >>>>>>>> port: 1 >>>>>>>> state: PORT_ACTIVE (4) >>>>>>>> max_mtu: 4096 (5) >>>>>>>> active_mtu: 4096 (5) >>>>>>>> sm_lid: 1 >>>>>>>> port_lid: 83 >>>>>>>> port_lmc: 0x00 >>>>>>>> >>>>>>>> port: 2 >>>>>>>> state: PORT_DOWN (1) >>>>>>>> max_mtu: 4096 (5) >>>>>>>> active_mtu: 4096 (5) >>>>>>>> sm_lid: 0 >>>>>>>> port_lid: 0 >>>>>>>> port_lmc: 0x00 >>>>>>>> >>>>>>>>Best regards, >>>>>>>>Timur. >>>>>>>> >>>>>>>> >>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < >>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>Hi Timur, >>>>>>>>>seems that yalla component was not found in your OMPI tree. >>>>>>>>>can it be that your mpirun is not from hpcx? Can you please check >>>>>>>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing >>>>>>>>>to the right mpirun? >>>>>>>>> >>>>>>>>>Also, could you please check that yalla is present in the ompi_info -l >>>>>>>>>9 output? >>>>>>>>> >>>>>>>>>Thanks >>>>>>>>> >>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov < tismagi...@mail.ru >>>>>>>>>> wrote: >>>>>>>>>>I can password-less ssh to all nodes: >>>>>>>>>>base$ ssh node1 >>>>>>>>>>node1$ssh node2 >>>>>>>>>>Last login: Mon May 25 18:41:23 >>>>>>>>>>node2$ssh node3 >>>>>>>>>>Last login: Mon May 25 16:25:01 >>>>>>>>>>node3$ssh node4 >>>>>>>>>>Last login: Mon May 25 16:27:04 >>>>>>>>>>node4$ >>>>>>>>>> >>>>>>>>>>Is this correct? >>>>>>>>>> >>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < >>>>>>>>>>r...@open-mpi.org >: >>>>>>>>>> >>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue >>>>>>>>>>>indicates that you don’t have password-less ssh authorized between >>>>>>>>>>>the compute nodes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>>>>>>wrote: >>>>>>>>>>>>Hello! >>>>>>>>>>>> >>>>>>>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>>>>>>>>>>>OFED-1.5.4.1; >>>>>>>>>>>>CentOS release 6.2; >>>>>>>>>>>>infiniband 4x FDR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>I have two problems: >>>>>>>>>>>>1. I can not use mxm : >>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host >>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 >>>>>>>>>>>>./hello >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>A requested component was not found, or was unable to be opened. >>>>>>>>>>>>This >>>>>>>>>>>>means that this component is either not installed or is unable to >>>>>>>>>>>>be >>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>libraries >>>>>>>>>>>>that the component requires are unable to be found/loaded). Note >>>>>>>>>>>>that >>>>>>>>>>>>Open MPI stopped checking at the first component that it did not >>>>>>>>>>>>find. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>Host: node14 >>>>>>>>>>>> >>>>>>>>>>>>Framework: pml >>>>>>>>>>>> >>>>>>>>>>>>Component: yalla >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>process is >>>>>>>>>>>>likely to abort. There are many reasons that a parallel process >>>>>>>>>>>>can >>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>>>>>>>environment >>>>>>>>>>>>problems. This failure appears to be an internal failure; here's >>>>>>>>>>>>some >>>>>>>>>>>>additional information (which may only be relevant to an Open MPI >>>>>>>>>>>> >>>>>>>>>>>>developer): >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>> >>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>> >>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>abort, >>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>> >>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>> >>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed successfully; >>>>>>>>>>>>not able to aggregate error messages, >>>>>>>>>>>> and not able to guarantee that all other processes were killed! >>>>>>>>>>>> >>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>> >>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>abort, >>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>> >>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed successfully; >>>>>>>>>>>>not able to aggregate error messages, >>>>>>>>>>>> and not able to guarantee that all other processes were killed! >>>>>>>>>>>> >>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>> >>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>> >>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>abort, >>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>> >>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed successfully; >>>>>>>>>>>>not able to aggregate error messages, >>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>> >>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>> >>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>> >>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>abort, >>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>> >>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed successfully; >>>>>>>>>>>>not able to aggregate error messages, >>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>> >>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>> >>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>aborted. >>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero >>>>>>>>>>>>status, thus causing >>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Process name: [[9372,1],2] >>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>>>>>>>>>see all help / error messages >>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur >>>>>>>>>>>>e >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca >>>>>>>>>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>A requested component was not found, or was unable to be opened. >>>>>>>>>>>>This >>>>>>>>>>>>means that this component is either not installed or is unable to >>>>>>>>>>>>be >>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>libraries >>>>>>>>>>>>that the component requires are unable to be found/loaded). Note >>>>>>>>>>>>that >>>>>>>>>>>>Open MPI stopped checking at the first component that it did not >>>>>>>>>>>>find. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>Host: node5 >>>>>>>>>>>> >>>>>>>>>>>>Framework: pml >>>>>>>>>>>> >>>>>>>>>>>>Component: yalla >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>> >>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>> >>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>abort, >>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>> >>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed successfully; >>>>>>>>>>>>not able to aggregate error messages, >>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>process is >>>>>>>>>>>>likely to abort. There are many reasons that a parallel process >>>>>>>>>>>>can >>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>>>>>>>environment >>>>>>>>>>>>problems. This failure appears to be an internal failure; here's >>>>>>>>>>>>some >>>>>>>>>>>>additional information (which may only be relevant to an Open MPI >>>>>>>>>>>> >>>>>>>>>>>>developer): >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>> >>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>> >>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>aborted. >>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>> >>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>> >>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>abort, >>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>> >>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed successfully; >>>>>>>>>>>>not able to aggregate error messages, >>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero >>>>>>>>>>>>status, thus causing >>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Process name: [[9619,1],0] >>>>>>>>>>>> >>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>[login:08552] 1 more process has sent help message >>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>>>>>>>>>see all help / error messages >>>>>>>>>>>> >>>>>>>>>>>>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd >>>>>>>>>>>>line : >>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn' >>>>>>>>>>>> >>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; >>>>>>>>>>>>export OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >>>>>>>>>>>> ; export PA >>>>>>>>>>>>TH ; >>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>>>>>>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >>>>>>>>>>>> ; expor >>>>>>>>>>>>t DYLD_LIBRARY_PATH ; >>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig >>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es >>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs >>>>>>>>>>>>"5" -mca orte_parent_uri "625606656.1;tc >>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri >>>>>>>>>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca >>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s >>>>>>>>>>>>pawn' >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>ORTE was unable to reliably start one or more daemons. >>>>>>>>>>>> >>>>>>>>>>>>This usually is caused by: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>* not finding the required libraries and/or binaries on >>>>>>>>>>>> >>>>>>>>>>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>>>>>>>>>> >>>>>>>>>>>> settings, or configure OMPI with >>>>>>>>>>>>--enable-orterun-prefix-by-default >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>* lack of authority to execute on one or more specified nodes. >>>>>>>>>>>> >>>>>>>>>>>> Please verify your allocation and authorities. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>* the inability to write startup files into /tmp >>>>>>>>>>>>(--tmpdir/orte_tmpdir_base). >>>>>>>>>>>> Please check with your sys admin to determine the correct >>>>>>>>>>>>location to use. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>* compilation of the orted with dynamic libraries when static are >>>>>>>>>>>>required >>>>>>>>>>>> (e.g., on Cray). Please check your configure cmd line and >>>>>>>>>>>>consider using >>>>>>>>>>>> one of the contrib/platform definitions for your system type. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>* an inability to create a connection back to mpirun due to a >>>>>>>>>>>> >>>>>>>>>>>> lack of common network interfaces and/or no route found between >>>>>>>>>>>> >>>>>>>>>>>> them. Please check network connectivity (including firewalls >>>>>>>>>>>> >>>>>>>>>>>> and network routing requirements). >>>>>>>>>>>> >>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to forcibly >>>>>>>>>>>>terminate >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>Thank you for your comments. >>>>>>>>>>>> >>>>>>>>>>>>Best regards, >>>>>>>>>>>>Timur. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>users mailing list >>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>Link to this post: >>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>_______________________________________________ >>>>>>>>>>users mailing list >>>>>>>>>>us...@open-mpi.org >>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>Link to this post: >>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>-- >>>>>>>>> >>>>>>>>>Kind Regards, >>>>>>>>> >>>>>>>>>M. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-- >>>>>>> >>>>>>>Kind Regards, >>>>>>> >>>>>>>M. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>-- >>>>> >>>>>Kind Regards, >>>>> >>>>>M. >>>> >>>> >>> >>> >>> >>>-- >>> >>>Kind Regards, >>> >>>M. >> >> > > > >-- > >Kind Regards, > >M.