I can password-less ssh to all nodes:
base$ ssh node1
node1$ssh node2
Last login: Mon May 25 18:41:23 
node2$ssh node3
Last login: Mon May 25 16:25:01
node3$ssh node4
Last login: Mon May 25 16:27:04
node4$

Is this correct?

In ompi-1.9 i do not have no-tree-spawn problem.


Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org>:
>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that 
>you don’t have password-less ssh authorized between the compute nodes
>
>
>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>Hello!
>>
>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>OFED-1.5.4.1;
>>CentOS release 6.2;
>>infiniband 4x FDR
>>
>>
>>
>>I have two problems:
>>1. I can not use mxm :
>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca 
>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>--------------------------------------------------------------------------    
>>                           
>>A requested component was not found, or was unable to be opened.  This        
>>                           
>>means that this component is either not installed or is unable to be          
>>                           
>>used on your system (e.g., sometimes this means that shared libraries         
>>                           
>>that the component requires are unable to be found/loaded).  Note that        
>>                           
>>Open MPI stopped checking at the first component that it did not find.        
>>                           
>>                                                                              
>>                           
>>Host:      node14                                                             
>>                           
>>Framework: pml                                                                
>>                           
>>Component: yalla                                                              
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>*** An error occurred in MPI_Init                                             
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>It looks like MPI_INIT failed for some reason; your parallel process is       
>>                           
>>likely to abort.  There are many reasons that a parallel process can          
>>                           
>>fail during MPI_INIT; some of which are due to configuration or environment   
>>                           
>>problems.  This failure appears to be an internal failure; here's some        
>>                           
>>additional information (which may only be relevant to an Open MPI             
>>                           
>>developer):                                                                   
>>                           
>>                                                                              
>>                           
>>  mca_pml_base_open() failed                                                  
>>                           
>>  --> Returned "Not found" (-13) instead of "Success" (0)                     
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>*** on a NULL communicator                                                    
>>                           
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                           
>>***    and potentially your MPI job)                                          
>>                           
>>*** An error occurred in MPI_Init                                             
>>                           
>>[node28:102377] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages,
>> and not able to guarantee that all other processes were killed!              
>>                           
>>*** on a NULL communicator                                                    
>>                           
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                           
>>***    and potentially your MPI job)                                          
>>                           
>>[node29:105600] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages,
>> and not able to guarantee that all other processes were killed!              
>>                           
>>*** An error occurred in MPI_Init                                             
>>                           
>>*** on a NULL communicator                                                    
>>                           
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                           
>>***    and potentially your MPI job)                                          
>>                           
>>[node5:102409] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages, 
>>and not able to guarantee that all other processes were killed!               
>>                           
>>*** An error occurred in MPI_Init                                             
>>                           
>>*** on a NULL communicator                                                    
>>                           
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                           
>>***    and potentially your MPI job)                                          
>>                           
>>[node14:85284] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages, 
>>and not able to guarantee that all other processes were killed!               
>>                           
>>-------------------------------------------------------                       
>>                           
>>Primary job  terminated normally, but 1 process returned                      
>>                           
>>a non-zero exit code.. Per user-direction, the job has been aborted.          
>>                           
>>-------------------------------------------------------                       
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>mpirun detected that one or more processes exited with non-zero status, thus 
>>causing                     
>>the job to be terminated. The first process to do so was:                     
>>                           
>>                                                                              
>>                           
>>  Process name: [[9372,1],2]
>>  Exit code:    1                                                             
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>[login:08295] 3 more processes have sent help message help-mca-base.txt / 
>>find-available:not-valid       
>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>help / error messages         
>>[login:08295] 3 more processes have sent help message help-mpi-runtime / 
>>mpi_init:startup:internal-failur
>>e                                                                             
>>                           
>>
>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>--------------------------------------------------------------------------    
>>                          
>>A requested component was not found, or was unable to be opened.  This        
>>                          
>>means that this component is either not installed or is unable to be          
>>                          
>>used on your system (e.g., sometimes this means that shared libraries         
>>                          
>>that the component requires are unable to be found/loaded).  Note that        
>>                          
>>Open MPI stopped checking at the first component that it did not find.        
>>                          
>>                                                                              
>>                          
>>Host:      node5                                                              
>>                          
>>Framework: pml                                                                
>>                          
>>Component: yalla                                                              
>>                          
>>--------------------------------------------------------------------------    
>>                          
>>*** An error occurred in MPI_Init                                             
>>                          
>>*** on a NULL communicator                                                    
>>                          
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                          
>>***    and potentially your MPI job)                                          
>>                          
>>[node5:102449] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages,
>>and not able to guarantee that all other processes were killed!               
>>                          
>>--------------------------------------------------------------------------    
>>                          
>>It looks like MPI_INIT failed for some reason; your parallel process is       
>>                          
>>likely to abort.  There are many reasons that a parallel process can          
>>                          
>>fail during MPI_INIT; some of which are due to configuration or environment   
>>                          
>>problems.  This failure appears to be an internal failure; here's some        
>>                          
>>additional information (which may only be relevant to an Open MPI             
>>                          
>>developer):                                                                   
>>                          
>>                                                                              
>>                          
>>  mca_pml_base_open() failed                                                  
>>                          
>>  --> Returned "Not found" (-13) instead of "Success" (0)                     
>>                          
>>--------------------------------------------------------------------------    
>>                          
>>-------------------------------------------------------                       
>>                          
>>Primary job  terminated normally, but 1 process returned                      
>>                          
>>a non-zero exit code.. Per user-direction, the job has been aborted.          
>>                          
>>-------------------------------------------------------                       
>>                          
>>*** An error occurred in MPI_Init                                             
>>                          
>>*** on a NULL communicator                                                    
>>                          
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                          
>>***    and potentially your MPI job)                                          
>>                          
>>[node14:85325] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages,
>>and not able to guarantee that all other processes were killed!               
>>                          
>>--------------------------------------------------------------------------    
>>                          
>>mpirun detected that one or more processes exited with non-zero status, thus 
>>causing                    
>>the job to be terminated. The first process to do so was:                     
>>                          
>>                                                                              
>>                          
>>  Process name: [[9619,1],0]                                                  
>>                          
>>  Exit code:    1                                                             
>>                          
>>--------------------------------------------------------------------------    
>>                          
>>[login:08552] 1 more process has sent help message help-mca-base.txt / 
>>find-available:not-valid         
>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>help / error messages        
>>
>>2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line :
>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello
>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn'             
>>                           
>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export 
>>OPAL_PREFIX; PATH=/gpfs/NETHOME/o
>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>> ; export PA
>>TH ; 
>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>> ; expor
>>t DYLD_LIBRARY_PATH ;   
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 
>>-mca ess "env" -mca orte_es
>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca 
>>orte_parent_uri "625606656.1;tc
>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri 
>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8
>>3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn 
>>"0" -mca plm "rsh" ) --tree-s
>>pawn'                                                                         
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>ORTE was unable to reliably start one or more daemons.                        
>>                           
>>This usually is caused by:                                                    
>>                           
>>                                                                              
>>                           
>>* not finding the required libraries and/or binaries on                       
>>                           
>>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH               
>>                           
>>  settings, or configure OMPI with --enable-orterun-prefix-by-default         
>>                           
>>                                                                              
>>                           
>>* lack of authority to execute on one or more specified nodes.                
>>                           
>>  Please verify your allocation and authorities.                              
>>                           
>>                                                                              
>>                           
>>* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). 
>>                           
>>  Please check with your sys admin to determine the correct location to use.  
>>                           
>>                                                                              
>>                           
>>*  compilation of the orted with dynamic libraries when static are required   
>>                           
>>  (e.g., on Cray). Please check your configure cmd line and consider using    
>>                           
>>  one of the contrib/platform definitions for your system type.               
>>                           
>>                                                                              
>>                           
>>* an inability to create a connection back to mpirun due to a                 
>>                           
>>  lack of common network interfaces and/or no route found between             
>>                           
>>  them. Please check network connectivity (including firewalls                
>>                           
>>  and network routing requirements).                                          
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate 
>>                         
>>                                                                              
>>                           
>>Thank you for your comments.
>> 
>>Best regards,
>>Timur.
>> 
>>
>>
>>_______________________________________________
>>users mailing list
>>us...@open-mpi.org
>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>Link to this post:  
>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>




Reply via email to