I can password-less ssh to all nodes: base$ ssh node1 node1$ssh node2 Last login: Mon May 25 18:41:23 node2$ssh node3 Last login: Mon May 25 16:25:01 node3$ssh node4 Last login: Mon May 25 16:27:04 node4$
Is this correct? In ompi-1.9 i do not have no-tree-spawn problem. Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org>: >I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that >you don’t have password-less ssh authorized between the compute nodes > > >>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>Hello! >> >>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>OFED-1.5.4.1; >>CentOS release 6.2; >>infiniband 4x FDR >> >> >> >>I have two problems: >>1. I can not use mxm : >>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca >>plm_rsh_no_tree_spawn 1 -np 4 ./hello >>-------------------------------------------------------------------------- >> >>A requested component was not found, or was unable to be opened. This >> >>means that this component is either not installed or is unable to be >> >>used on your system (e.g., sometimes this means that shared libraries >> >>that the component requires are unable to be found/loaded). Note that >> >>Open MPI stopped checking at the first component that it did not find. >> >> >> >>Host: node14 >> >>Framework: pml >> >>Component: yalla >> >>-------------------------------------------------------------------------- >> >>*** An error occurred in MPI_Init >> >>-------------------------------------------------------------------------- >> >>It looks like MPI_INIT failed for some reason; your parallel process is >> >>likely to abort. There are many reasons that a parallel process can >> >>fail during MPI_INIT; some of which are due to configuration or environment >> >>problems. This failure appears to be an internal failure; here's some >> >>additional information (which may only be relevant to an Open MPI >> >>developer): >> >> >> >> mca_pml_base_open() failed >> >> --> Returned "Not found" (-13) instead of "Success" (0) >> >>-------------------------------------------------------------------------- >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>*** An error occurred in MPI_Init >> >>[node28:102377] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >> and not able to guarantee that all other processes were killed! >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>[node29:105600] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >> and not able to guarantee that all other processes were killed! >> >>*** An error occurred in MPI_Init >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>[node5:102409] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >>and not able to guarantee that all other processes were killed! >> >>*** An error occurred in MPI_Init >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>[node14:85284] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >>and not able to guarantee that all other processes were killed! >> >>------------------------------------------------------- >> >>Primary job terminated normally, but 1 process returned >> >>a non-zero exit code.. Per user-direction, the job has been aborted. >> >>------------------------------------------------------- >> >>-------------------------------------------------------------------------- >> >>mpirun detected that one or more processes exited with non-zero status, thus >>causing >>the job to be terminated. The first process to do so was: >> >> >> >> Process name: [[9372,1],2] >> Exit code: 1 >> >>-------------------------------------------------------------------------- >> >>[login:08295] 3 more processes have sent help message help-mca-base.txt / >>find-available:not-valid >>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>help / error messages >>[login:08295] 3 more processes have sent help message help-mpi-runtime / >>mpi_init:startup:internal-failur >>e >> >> >>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca >>plm_rsh_no_tree_spawn 1 -np 4 ./hello >>-------------------------------------------------------------------------- >> >>A requested component was not found, or was unable to be opened. This >> >>means that this component is either not installed or is unable to be >> >>used on your system (e.g., sometimes this means that shared libraries >> >>that the component requires are unable to be found/loaded). Note that >> >>Open MPI stopped checking at the first component that it did not find. >> >> >> >>Host: node5 >> >>Framework: pml >> >>Component: yalla >> >>-------------------------------------------------------------------------- >> >>*** An error occurred in MPI_Init >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>[node5:102449] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >>and not able to guarantee that all other processes were killed! >> >>-------------------------------------------------------------------------- >> >>It looks like MPI_INIT failed for some reason; your parallel process is >> >>likely to abort. There are many reasons that a parallel process can >> >>fail during MPI_INIT; some of which are due to configuration or environment >> >>problems. This failure appears to be an internal failure; here's some >> >>additional information (which may only be relevant to an Open MPI >> >>developer): >> >> >> >> mca_pml_base_open() failed >> >> --> Returned "Not found" (-13) instead of "Success" (0) >> >>-------------------------------------------------------------------------- >> >>------------------------------------------------------- >> >>Primary job terminated normally, but 1 process returned >> >>a non-zero exit code.. Per user-direction, the job has been aborted. >> >>------------------------------------------------------- >> >>*** An error occurred in MPI_Init >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>[node14:85325] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >>and not able to guarantee that all other processes were killed! >> >>-------------------------------------------------------------------------- >> >>mpirun detected that one or more processes exited with non-zero status, thus >>causing >>the job to be terminated. The first process to do so was: >> >> >> >> Process name: [[9619,1],0] >> >> Exit code: 1 >> >>-------------------------------------------------------------------------- >> >>[login:08552] 1 more process has sent help message help-mca-base.txt / >>find-available:not-valid >>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>help / error messages >> >>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line : >>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>sh: -c: line 0: syntax error near unexpected token `--tree-spawn' >> >>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export >>OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >> ; export PA >>TH ; >>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >> ; expor >>t DYLD_LIBRARY_PATH ; >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 >>-mca ess "env" -mca orte_es >>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca >>orte_parent_uri "625606656.1;tc >>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri >>"625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn >>"0" -mca plm "rsh" ) --tree-s >>pawn' >> >>-------------------------------------------------------------------------- >> >>ORTE was unable to reliably start one or more daemons. >> >>This usually is caused by: >> >> >> >>* not finding the required libraries and/or binaries on >> >> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >> >> settings, or configure OMPI with --enable-orterun-prefix-by-default >> >> >> >>* lack of authority to execute on one or more specified nodes. >> >> Please verify your allocation and authorities. >> >> >> >>* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). >> >> Please check with your sys admin to determine the correct location to use. >> >> >> >>* compilation of the orted with dynamic libraries when static are required >> >> (e.g., on Cray). Please check your configure cmd line and consider using >> >> one of the contrib/platform definitions for your system type. >> >> >> >>* an inability to create a connection back to mpirun due to a >> >> lack of common network interfaces and/or no route found between >> >> them. Please check network connectivity (including firewalls >> >> and network routing requirements). >> >>-------------------------------------------------------------------------- >> >>mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate >> >> >> >>Thank you for your comments. >> >>Best regards, >>Timur. >> >> >> >>_______________________________________________ >>users mailing list >>us...@open-mpi.org >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>Link to this post: >>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >