1. mxm_perf_test - OK.
2. no_tree_spawn  - OK.
3. ompi yalla and "--mca pml cm --mca mtl mxm" still  does not  work (I use 
prebuild ompi-1.8.5 from hpcx-v1.3.330)
3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm --prefix 
$HPCX_MPI_DIR ./hello
--------------------------------------------------------------------------      
                         
A requested component was not found, or was unable to be opened.  This          
                         
means that this component is either not installed or is unable to be            
                         
used on your system (e.g., sometimes this means that shared libraries           
                         
that the component requires are unable to be found/loaded).  Note that          
                         
Open MPI stopped checking at the first component that it did not find.          
                         
                                                                                
                         
Host:      node153                                                              
                         
Framework: mtl                                                                  
                         
Component: mxm                                                                  
                         
--------------------------------------------------------------------------      
                         
[node5:113560] PML cm cannot be selected                                        
                         
--------------------------------------------------------------------------      
                         
No available pml components were found!                                         
                         
                                                                                
                         
This means that there are no components of this type installed on your          
                         
system or all the components reported that they could not be used.              
                         
                                                                                
                         
This is a fatal error; your MPI process is likely to abort.  Check the          
                         
output of the "ompi_info" command and ensure that components of this            
                         
type are available on your system.  You may also wish to check the              
                         
value of the "component_path" MCA parameter and ensure that it has at           
                         
least one directory that contains valid MCA components.                         
                         
--------------------------------------------------------------------------      
                         
[node153:44440] PML cm cannot be selected                                       
                         
-------------------------------------------------------                         
                         
Primary job  terminated normally, but 1 process returned                        
                         
a non-zero exit code.. Per user-direction, the job has been aborted.            
                         
-------------------------------------------------------                         
                         
--------------------------------------------------------------------------      
                         
mpirun detected that one or more processes exited with non-zero status, thus 
causing                     
the job to be terminated. The first process to do so was:                       
                         
                                                                                
                         
  Process name: [[43917,1],0]                                                   
                         
  Exit code:    1                                                               
                         
--------------------------------------------------------------------------      
                         
[login:110455] 1 more process has sent help message help-mca-base.txt / 
find-available:not-valid         
[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages        
[login:110455] 1 more process has sent help message help-mca-base.txt / 
find-available:none-found        
                            
3.b) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix 
$HPCX_MPI_DIR ./hello
--------------------------------------------------------------------------      
                         
A requested component was not found, or was unable to be opened.  This          
                         
means that this component is either not installed or is unable to be            
                         
used on your system (e.g., sometimes this means that shared libraries           
                         
that the component requires are unable to be found/loaded).  Note that          
                         
Open MPI stopped checking at the first component that it did not find.          
                         
                                                                                
                         
Host:      node153                                                              
                         
Framework: pml                                                                  
                         
Component: yalla                                                                
                         
--------------------------------------------------------------------------      
                         
*** An error occurred in MPI_Init                                               
                         
--------------------------------------------------------------------------      
                         
It looks like MPI_INIT failed for some reason; your parallel process is         
                         
likely to abort.  There are many reasons that a parallel process can            
                         
fail during MPI_INIT; some of which are due to configuration or environment     
                         
problems.  This failure appears to be an internal failure; here's some          
                         
additional information (which may only be relevant to an Open MPI               
                         
developer):                                                                     
                         
                                                                                
                         
  mca_pml_base_open() failed                                                    
                         
  --> Returned "Not found" (-13) instead of "Success" (0)                       
                         
--------------------------------------------------------------------------      
                         
*** on a NULL communicator                                                      
                         
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,        
                         
***    and potentially your MPI job)                                            
                         
[node153:43979] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
 and not able to guarantee that all other processes were killed!                
                         
-------------------------------------------------------                         
                         
Primary job  terminated normally, but 1 process returned                        
                         
a non-zero exit code.. Per user-direction, the job has been aborted.            
                         
-------------------------------------------------------                         
                         
--------------------------------------------------------------------------      
                         
mpirun detected that one or more processes exited with non-zero status, thus 
causing                     
the job to be terminated. The first process to do so was:                       
                         
                                                                                
                         
  Process name: [[44992,1],1]                                                   
                         
  Exit code:    1                                                               
                         
--------------------------------------------------------------------------      
                         



host:$  echo $HPCX_MPI_DIR                                                      
                    
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8
host:$ ompi_info | grep pml                                                     
                    
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5)              
                         
                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5)             
                         
                 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5)            
                         
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5)            
                         
                 MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5)  
host: tests$  ompi_info | grep mtl                                   
                 MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5)

P.S.
possible error in the FAQ? 
(http://www.open-mpi.org/faq/?category=openfabrics#mxm)
47. Does Open MPI support  MXM ?
............
NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 and 
above
...........
But here we have(or not...) yalla in ompi 1.8.5



Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il>:
>Hi Timur,
>
>Here it goes:
>
>wget  
>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz
>
>Please let me know if it works for you and will add 1.5.4.1 mofed to the 
>default distribution list.
>
>M
>
>
>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Thanks a lot .
>>
>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < 
>>mi...@dev.mellanox.co.il >:
>>
>>>will send u the link tomorrow.
>>>
>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>Where can i find MXM for ofed 1.5.4.1?
>>>>
>>>>
>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < 
>>>>mi...@dev.mellanox.co.il >:
>>>>
>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed 1.5.3
>>>>>
>>>>>seems like ABI issue between ofed versions
>>>>>
>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>wrote:
>>>>>>I did as you said, but got an error:
>>>>>>
>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>node1$  ./mxm_perftest                                                    
>>>>>>                        
>>>>>>Waiting for connection...                                                 
>>>>>>                               
>>>>>>Accepted connection from 10.65.0.253                                      
>>>>>>                               
>>>>>>[1432576262.370195] [node153:35388:0]         shm.c:65   MXM  WARN  Could 
>>>>>>not open the KNEM device file at /dev/knem : No such file or directory. 
>>>>>>Won't use knem.                                                 
>>>>>>Failed to create endpoint: No such device                                 
>>>>>>                               
>>>>>>
>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>node2$ ./mxm_perftest node1  -t send_lat                                  
>>>>>>                     
>>>>>>[1432576262.367523] [node158:99366:0]         shm.c:65   MXM  WARN  Could 
>>>>>>not open the KNEM device file at /dev/knem : No such file or directory. 
>>>>>>Won't use knem.
>>>>>>Failed to create endpoint: No such device
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < 
>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>scif is a OFA device from Intel.
>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
>>>>>>>
>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>>>wrote:
>>>>>>>>Hi, Mike,
>>>>>>>>that is what i have:
>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n"
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>>>>>>>>               
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>>>>>>>>             
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>>>>>>>>               
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>>>>>>>> +intel compiler paths
>>>>>>>>
>>>>>>>>$ echo $OPAL_PREFIX                                                     
>>>>>>>>                
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>>>>>>>>
>>>>>>>>I don't use LD_PRELOAD.
>>>>>>>>
>>>>>>>>In the attached file(ompi_info.out) you will find the output of 
>>>>>>>>ompi_info -l 9  command.
>>>>>>>>
>>>>>>>>P.S . 
>>>>>>>>node1 $ ./mxm_perftest
>>>>>>>>node2 $  ./mxm_perftest node1  -t send_lat
>>>>>>>>[1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  
>>>>>>>>Could not open the KNEM device file $t /dev/knem : No such file or 
>>>>>>>>directory. Won't use knem.          ( I don't have knem)
>>>>>>>>[1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN  
>>>>>>>>skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox 
>>>>>>>>device                                (???)
>>>>>>>>Failed to create endpoint: No such device
>>>>>>>>
>>>>>>>>$  ibv_devinfo                                         
>>>>>>>>hca_id: mlx4_0                                                  
>>>>>>>>        transport:                      InfiniBand (0)          
>>>>>>>>        fw_ver:                         2.10.600                
>>>>>>>>        node_guid:                      0002:c903:00a1:13b0     
>>>>>>>>        sys_image_guid:                 0002:c903:00a1:13b3     
>>>>>>>>        vendor_id:                      0x02c9                  
>>>>>>>>        vendor_part_id:                 4099                    
>>>>>>>>        hw_ver:                         0x0                     
>>>>>>>>        board_id:                       MT_1090120019           
>>>>>>>>        phys_port_cnt:                  2                       
>>>>>>>>                port:   1                                       
>>>>>>>>                        state:                  PORT_ACTIVE (4) 
>>>>>>>>                        max_mtu:                4096 (5)        
>>>>>>>>                        active_mtu:             4096 (5)        
>>>>>>>>                        sm_lid:                 1               
>>>>>>>>                        port_lid:               83              
>>>>>>>>                        port_lmc:               0x00            
>>>>>>>>                                                                
>>>>>>>>                port:   2                                       
>>>>>>>>                        state:                  PORT_DOWN (1)   
>>>>>>>>                        max_mtu:                4096 (5)        
>>>>>>>>                        active_mtu:             4096 (5)        
>>>>>>>>                        sm_lid:                 0               
>>>>>>>>                        port_lid:               0               
>>>>>>>>                        port_lmc:               0x00            
>>>>>>>>
>>>>>>>>Best regards,
>>>>>>>>Timur.
>>>>>>>>
>>>>>>>>
>>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < 
>>>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>>>Hi Timur,
>>>>>>>>>seems that yalla component was not found in your OMPI tree.
>>>>>>>>>can it be that your mpirun is not from hpcx? Can you please check 
>>>>>>>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing 
>>>>>>>>>to the right mpirun?
>>>>>>>>>
>>>>>>>>>Also, could you please check that yalla is present in the ompi_info -l 
>>>>>>>>>9 output?
>>>>>>>>>
>>>>>>>>>Thanks
>>>>>>>>>
>>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  < tismagi...@mail.ru 
>>>>>>>>>> wrote:
>>>>>>>>>>I can password-less ssh to all nodes:
>>>>>>>>>>base$ ssh node1
>>>>>>>>>>node1$ssh node2
>>>>>>>>>>Last login: Mon May 25 18:41:23 
>>>>>>>>>>node2$ssh node3
>>>>>>>>>>Last login: Mon May 25 16:25:01
>>>>>>>>>>node3$ssh node4
>>>>>>>>>>Last login: Mon May 25 16:27:04
>>>>>>>>>>node4$
>>>>>>>>>>
>>>>>>>>>>Is this correct?
>>>>>>>>>>
>>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < 
>>>>>>>>>>r...@open-mpi.org >:
>>>>>>>>>>
>>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue 
>>>>>>>>>>>indicates that you don’t have password-less ssh authorized between 
>>>>>>>>>>>the compute nodes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > 
>>>>>>>>>>>>wrote:
>>>>>>>>>>>>Hello!
>>>>>>>>>>>>
>>>>>>>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>>>>>>>>>>>OFED-1.5.4.1;
>>>>>>>>>>>>CentOS release 6.2;
>>>>>>>>>>>>infiniband 4x FDR
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>I have two problems:
>>>>>>>>>>>>1. I can not use mxm :
>>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host 
>>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 
>>>>>>>>>>>>./hello
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>A requested component was not found, or was unable to be opened.  
>>>>>>>>>>>>This                                   
>>>>>>>>>>>>means that this component is either not installed or is unable to 
>>>>>>>>>>>>be                                     
>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared 
>>>>>>>>>>>>libraries                                    
>>>>>>>>>>>>that the component requires are unable to be found/loaded).  Note 
>>>>>>>>>>>>that                                   
>>>>>>>>>>>>Open MPI stopped checking at the first component that it did not 
>>>>>>>>>>>>find.                                   
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>Host:      node14                                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>Framework: pml                                                      
>>>>>>>>>>>>                                     
>>>>>>>>>>>>Component: yalla                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>*** An error occurred in MPI_Init                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel 
>>>>>>>>>>>>process is                                  
>>>>>>>>>>>>likely to abort.  There are many reasons that a parallel process 
>>>>>>>>>>>>can                                     
>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>>>>>>>environment                              
>>>>>>>>>>>>problems.  This failure appears to be an internal failure; here's 
>>>>>>>>>>>>some                                   
>>>>>>>>>>>>additional information (which may only be relevant to an Open MPI   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>developer):                                                         
>>>>>>>>>>>>                                     
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  mca_pml_base_open() failed                                        
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)           
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>*** on a NULL communicator                                          
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>***    and potentially your MPI job)                                
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** An error occurred in MPI_Init                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed successfully; 
>>>>>>>>>>>>not able to aggregate error messages,
>>>>>>>>>>>> and not able to guarantee that all other processes were killed!    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** on a NULL communicator                                          
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>***    and potentially your MPI job)                                
>>>>>>>>>>>>                                     
>>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed successfully; 
>>>>>>>>>>>>not able to aggregate error messages,
>>>>>>>>>>>> and not able to guarantee that all other processes were killed!    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** An error occurred in MPI_Init                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** on a NULL communicator                                          
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>***    and potentially your MPI job)                                
>>>>>>>>>>>>                                     
>>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed successfully; 
>>>>>>>>>>>>not able to aggregate error messages, 
>>>>>>>>>>>>and not able to guarantee that all other processes were killed!     
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** An error occurred in MPI_Init                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** on a NULL communicator                                          
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>***    and potentially your MPI job)                                
>>>>>>>>>>>>                                     
>>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed successfully; 
>>>>>>>>>>>>not able to aggregate error messages, 
>>>>>>>>>>>>and not able to guarantee that all other processes were killed!     
>>>>>>>>>>>>                                     
>>>>>>>>>>>>-------------------------------------------------------             
>>>>>>>>>>>>                                     
>>>>>>>>>>>>Primary job  terminated normally, but 1 process returned            
>>>>>>>>>>>>                                     
>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been 
>>>>>>>>>>>>aborted.                                     
>>>>>>>>>>>>-------------------------------------------------------             
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero 
>>>>>>>>>>>>status, thus causing                     
>>>>>>>>>>>>the job to be terminated. The first process to do so was:           
>>>>>>>>>>>>                                     
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  Process name: [[9372,1],2]
>>>>>>>>>>>>  Exit code:    1                                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>[login:08295] 3 more processes have sent help message 
>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid       
>>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>>>>>>>>>>>see all help / error messages         
>>>>>>>>>>>>[login:08295] 3 more processes have sent help message 
>>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur
>>>>>>>>>>>>e                                                                   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>
>>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
>>>>>>>>>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                              
>>>>>>>>>>>>A requested component was not found, or was unable to be opened.  
>>>>>>>>>>>>This                                  
>>>>>>>>>>>>means that this component is either not installed or is unable to 
>>>>>>>>>>>>be                                    
>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared 
>>>>>>>>>>>>libraries                                   
>>>>>>>>>>>>that the component requires are unable to be found/loaded).  Note 
>>>>>>>>>>>>that                                  
>>>>>>>>>>>>Open MPI stopped checking at the first component that it did not 
>>>>>>>>>>>>find.                                  
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                    
>>>>>>>>>>>>Host:      node5                                                    
>>>>>>>>>>>>                                    
>>>>>>>>>>>>Framework: pml                                                      
>>>>>>>>>>>>                                    
>>>>>>>>>>>>Component: yalla                                                    
>>>>>>>>>>>>                                    
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                              
>>>>>>>>>>>>*** An error occurred in MPI_Init                                   
>>>>>>>>>>>>                                    
>>>>>>>>>>>>*** on a NULL communicator                                          
>>>>>>>>>>>>                                    
>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>abort,                                
>>>>>>>>>>>>***    and potentially your MPI job)                                
>>>>>>>>>>>>                                    
>>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed successfully; 
>>>>>>>>>>>>not able to aggregate error messages,
>>>>>>>>>>>>and not able to guarantee that all other processes were killed!     
>>>>>>>>>>>>                                    
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                              
>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel 
>>>>>>>>>>>>process is                                 
>>>>>>>>>>>>likely to abort.  There are many reasons that a parallel process 
>>>>>>>>>>>>can                                    
>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>>>>>>>environment                             
>>>>>>>>>>>>problems.  This failure appears to be an internal failure; here's 
>>>>>>>>>>>>some                                  
>>>>>>>>>>>>additional information (which may only be relevant to an Open MPI   
>>>>>>>>>>>>                                    
>>>>>>>>>>>>developer):                                                         
>>>>>>>>>>>>                                    
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                    
>>>>>>>>>>>>  mca_pml_base_open() failed                                        
>>>>>>>>>>>>                                    
>>>>>>>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)           
>>>>>>>>>>>>                                    
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                              
>>>>>>>>>>>>-------------------------------------------------------             
>>>>>>>>>>>>                                    
>>>>>>>>>>>>Primary job  terminated normally, but 1 process returned            
>>>>>>>>>>>>                                    
>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been 
>>>>>>>>>>>>aborted.                                    
>>>>>>>>>>>>-------------------------------------------------------             
>>>>>>>>>>>>                                    
>>>>>>>>>>>>*** An error occurred in MPI_Init                                   
>>>>>>>>>>>>                                    
>>>>>>>>>>>>*** on a NULL communicator                                          
>>>>>>>>>>>>                                    
>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>abort,                                
>>>>>>>>>>>>***    and potentially your MPI job)                                
>>>>>>>>>>>>                                    
>>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed successfully; 
>>>>>>>>>>>>not able to aggregate error messages,
>>>>>>>>>>>>and not able to guarantee that all other processes were killed!     
>>>>>>>>>>>>                                    
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                              
>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero 
>>>>>>>>>>>>status, thus causing                    
>>>>>>>>>>>>the job to be terminated. The first process to do so was:           
>>>>>>>>>>>>                                    
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                    
>>>>>>>>>>>>  Process name: [[9619,1],0]                                        
>>>>>>>>>>>>                                    
>>>>>>>>>>>>  Exit code:    1                                                   
>>>>>>>>>>>>                                    
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                              
>>>>>>>>>>>>[login:08552] 1 more process has sent help message 
>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid         
>>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>>>>>>>>>>>see all help / error messages        
>>>>>>>>>>>>
>>>>>>>>>>>>2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun cmd 
>>>>>>>>>>>>line :
>>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello
>>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn'   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
>>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
>>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; 
>>>>>>>>>>>>export OPAL_PREFIX; PATH=/gpfs/NETHOME/o
>>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>>>>>>>>>>>> ; export PA
>>>>>>>>>>>>TH ; 
>>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
>>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
>>>>>>>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
>>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>>>>>>>>>>>> ; expor
>>>>>>>>>>>>t DYLD_LIBRARY_PATH ;   
>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
>>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 
>>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
>>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs 
>>>>>>>>>>>>"5" -mca orte_parent_uri "625606656.1;tc
>>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri 
>>>>>>>>>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8
>>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca 
>>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
>>>>>>>>>>>>pawn'                                                               
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>ORTE was unable to reliably start one or more daemons.              
>>>>>>>>>>>>                                     
>>>>>>>>>>>>This usually is caused by:                                          
>>>>>>>>>>>>                                     
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>* not finding the required libraries and/or binaries on             
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH     
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  settings, or configure OMPI with 
>>>>>>>>>>>>--enable-orterun-prefix-by-default                                  
>>>>>>>>>>>>  
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>* lack of authority to execute on one or more specified nodes.      
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  Please verify your allocation and authorities.                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>* the inability to write startup files into /tmp 
>>>>>>>>>>>>(--tmpdir/orte_tmpdir_base).                            
>>>>>>>>>>>>  Please check with your sys admin to determine the correct 
>>>>>>>>>>>>location to use.                             
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>*  compilation of the orted with dynamic libraries when static are 
>>>>>>>>>>>>required                              
>>>>>>>>>>>>  (e.g., on Cray). Please check your configure cmd line and 
>>>>>>>>>>>>consider using                               
>>>>>>>>>>>>  one of the contrib/platform definitions for your system type.     
>>>>>>>>>>>>                                     
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>* an inability to create a connection back to mpirun due to a       
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  lack of common network interfaces and/or no route found between   
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  them. Please check network connectivity (including firewalls      
>>>>>>>>>>>>                                     
>>>>>>>>>>>>  and network routing requirements).                                
>>>>>>>>>>>>                                     
>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>                               
>>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to forcibly 
>>>>>>>>>>>>terminate                          
>>>>>>>>>>>>                                                                    
>>>>>>>>>>>>                                     
>>>>>>>>>>>>Thank you for your comments.
>>>>>>>>>>>> 
>>>>>>>>>>>>Best regards,
>>>>>>>>>>>>Timur.
>>>>>>>>>>>> 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>users mailing list
>>>>>>>>>>>>us...@open-mpi.org
>>>>>>>>>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>Link to this post:  
>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>_______________________________________________
>>>>>>>>>>users mailing list
>>>>>>>>>>us...@open-mpi.org
>>>>>>>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>Link to this post:  
>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>-- 
>>>>>>>>>
>>>>>>>>>Kind Regards,
>>>>>>>>>
>>>>>>>>>M.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>
>>>>>>>Kind Regards,
>>>>>>>
>>>>>>>M.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-- 
>>>>>
>>>>>Kind Regards,
>>>>>
>>>>>M.
>>>>
>>>>
>>>
>>>
>>>
>>>-- 
>>>
>>>Kind Regards,
>>>
>>>M.
>>
>>
>
>
>
>-- 
>
>Kind Regards,
>
>M.



Reply via email to