Hello!

I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
OFED-1.5.4.1;
CentOS release 6.2;
infiniband 4x FDR



I have two problems:
1. I can not use mxm :
1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca 
plm_rsh_no_tree_spawn 1 -np 4 ./hello
--------------------------------------------------------------------------      
                         
A requested component was not found, or was unable to be opened.  This          
                         
means that this component is either not installed or is unable to be            
                         
used on your system (e.g., sometimes this means that shared libraries           
                         
that the component requires are unable to be found/loaded).  Note that          
                         
Open MPI stopped checking at the first component that it did not find.          
                         
                                                                                
                         
Host:      node14                                                               
                         
Framework: pml                                                                  
                         
Component: yalla                                                                
                         
--------------------------------------------------------------------------      
                         
*** An error occurred in MPI_Init                                               
                         
--------------------------------------------------------------------------      
                         
It looks like MPI_INIT failed for some reason; your parallel process is         
                         
likely to abort.  There are many reasons that a parallel process can            
                         
fail during MPI_INIT; some of which are due to configuration or environment     
                         
problems.  This failure appears to be an internal failure; here's some          
                         
additional information (which may only be relevant to an Open MPI               
                         
developer):                                                                     
                         
                                                                                
                         
  mca_pml_base_open() failed                                                    
                         
  --> Returned "Not found" (-13) instead of "Success" (0)                       
                         
--------------------------------------------------------------------------      
                         
*** on a NULL communicator                                                      
                         
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                         
***    and potentially your MPI job)                                            
                         
*** An error occurred in MPI_Init                                               
                         
[node28:102377] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
 and not able to guarantee that all other processes were killed!                
                         
*** on a NULL communicator                                                      
                         
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                         
***    and potentially your MPI job)                                            
                         
[node29:105600] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
 and not able to guarantee that all other processes were killed!                
                         
*** An error occurred in MPI_Init                                               
                         
*** on a NULL communicator                                                      
                         
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                         
***    and potentially your MPI job)                                            
                         
[node5:102409] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages, 
and not able to guarantee that all other processes were killed!                 
                         
*** An error occurred in MPI_Init                                               
                         
*** on a NULL communicator                                                      
                         
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                         
***    and potentially your MPI job)                                            
                         
[node14:85284] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages, 
and not able to guarantee that all other processes were killed!                 
                         
-------------------------------------------------------                         
                         
Primary job  terminated normally, but 1 process returned                        
                         
a non-zero exit code.. Per user-direction, the job has been aborted.            
                         
-------------------------------------------------------                         
                         
--------------------------------------------------------------------------      
                         
mpirun detected that one or more processes exited with non-zero status, thus 
causing                     
the job to be terminated. The first process to do so was:                       
                         
                                                                                
                         
  Process name: [[9372,1],2]
  Exit code:    1                                                               
                         
--------------------------------------------------------------------------      
                         
[login:08295] 3 more processes have sent help message help-mca-base.txt / 
find-available:not-valid       
[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
/ error messages         
[login:08295] 3 more processes have sent help message help-mpi-runtime / 
mpi_init:startup:internal-failur
e                                                                               
                         

1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
plm_rsh_no_tree_spawn 1 -np 4 ./hello
--------------------------------------------------------------------------      
                        
A requested component was not found, or was unable to be opened.  This          
                        
means that this component is either not installed or is unable to be            
                        
used on your system (e.g., sometimes this means that shared libraries           
                        
that the component requires are unable to be found/loaded).  Note that          
                        
Open MPI stopped checking at the first component that it did not find.          
                        
                                                                                
                        
Host:      node5                                                                
                        
Framework: pml                                                                  
                        
Component: yalla                                                                
                        
--------------------------------------------------------------------------      
                        
*** An error occurred in MPI_Init                                               
                        
*** on a NULL communicator                                                      
                        
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                        
***    and potentially your MPI job)                                            
                        
[node5:102449] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
and not able to guarantee that all other processes were killed!                 
                        
--------------------------------------------------------------------------      
                        
It looks like MPI_INIT failed for some reason; your parallel process is         
                        
likely to abort.  There are many reasons that a parallel process can            
                        
fail during MPI_INIT; some of which are due to configuration or environment     
                        
problems.  This failure appears to be an internal failure; here's some          
                        
additional information (which may only be relevant to an Open MPI               
                        
developer):                                                                     
                        
                                                                                
                        
  mca_pml_base_open() failed                                                    
                        
  --> Returned "Not found" (-13) instead of "Success" (0)                       
                        
--------------------------------------------------------------------------      
                        
-------------------------------------------------------                         
                        
Primary job  terminated normally, but 1 process returned                        
                        
a non-zero exit code.. Per user-direction, the job has been aborted.            
                        
-------------------------------------------------------                         
                        
*** An error occurred in MPI_Init                                               
                        
*** on a NULL communicator                                                      
                        
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                        
***    and potentially your MPI job)                                            
                        
[node14:85325] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
and not able to guarantee that all other processes were killed!                 
                        
--------------------------------------------------------------------------      
                        
mpirun detected that one or more processes exited with non-zero status, thus 
causing                    
the job to be terminated. The first process to do so was:                       
                        
                                                                                
                        
  Process name: [[9619,1],0]                                                    
                        
  Exit code:    1                                                               
                        
--------------------------------------------------------------------------      
                        
[login:08552] 1 more process has sent help message help-mca-base.txt / 
find-available:not-valid         
[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
/ error messages        

2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line :
$mpirun -host node5,node14,node28,node29 -np 4 ./hello
sh: -c: line 0: syntax error near unexpected token `--tree-spawn'               
                         
sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export 
OPAL_PREFIX; PATH=/gpfs/NETHOME/o
ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
 ; export PA
TH ; 
LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
 ; expor
t DYLD_LIBRARY_PATH ;   
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 
-mca ess "env" -mca orte_es
s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca 
orte_parent_uri "625606656.1;tc
p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri 
"625606656.0;tcp://10.65.0.2,10.67.0.2,8
3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn "0" 
-mca plm "rsh" ) --tree-s
pawn'                                                                           
                         
--------------------------------------------------------------------------      
                         
ORTE was unable to reliably start one or more daemons.                          
                         
This usually is caused by:                                                      
                         
                                                                                
                         
* not finding the required libraries and/or binaries on                         
                         
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH                 
                         
  settings, or configure OMPI with --enable-orterun-prefix-by-default           
                         
                                                                                
                         
* lack of authority to execute on one or more specified nodes.                  
                         
  Please verify your allocation and authorities.                                
                         
                                                                                
                         
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).   
                         
  Please check with your sys admin to determine the correct location to use.    
                         
                                                                                
                         
*  compilation of the orted with dynamic libraries when static are required     
                         
  (e.g., on Cray). Please check your configure cmd line and consider using      
                         
  one of the contrib/platform definitions for your system type.                 
                         
                                                                                
                         
* an inability to create a connection back to mpirun due to a                   
                         
  lack of common network interfaces and/or no route found between               
                         
  them. Please check network connectivity (including firewalls                  
                         
  and network routing requirements).                                            
                         
--------------------------------------------------------------------------      
                         
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate   
                       
                                                                                
                         
Thank you for your comments.
 
Best regards,
Timur.
 


Reply via email to