It does not work for single node:

1) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm 
--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
rml_base_verbose 10 --debug-daemons  -np 1 ./hello &>  yalla.out                
                 
2) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix 
$HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
rml_base_verbose 10 --debug-daemons -np 1 ./hello &>  cm_mxm.out

I've attached the  yalla.out and  cm_mxm.out to this email.



Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il>:
>does it work from single node?
>could you please run with opts below and attach output?
>
> -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose 10 
>--debug-daemons
>
>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>wrote:
>>1. mxm_perf_test - OK.
>>2. no_tree_spawn  - OK.
>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still  does not  work (I use 
>>prebuild ompi-1.8.5 from hpcx-v1.3.330)
>>3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm 
>>--prefix $HPCX_MPI_DIR ./hello
>>--------------------------------------------------------------------------    
>>                           
>>A requested component was not found, or was unable to be opened.  This        
>>                           
>>means that this component is either not installed or is unable to be          
>>                           
>>used on your system (e.g., sometimes this means that shared libraries         
>>                           
>>that the component requires are unable to be found/loaded).  Note that        
>>                           
>>Open MPI stopped checking at the first component that it did not find.        
>>                           
>>                                                                              
>>                           
>>Host:      node153                                                            
>>                           
>>Framework: mtl                                                                
>>                           
>>Component: mxm                                                                
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>[node5:113560] PML cm cannot be selected                                      
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>No available pml components were found!                                       
>>                           
>>                                                                              
>>                           
>>This means that there are no components of this type installed on your        
>>                           
>>system or all the components reported that they could not be used.            
>>                           
>>                                                                              
>>                           
>>This is a fatal error; your MPI process is likely to abort.  Check the        
>>                           
>>output of the "ompi_info" command and ensure that components of this          
>>                           
>>type are available on your system.  You may also wish to check the            
>>                           
>>value of the "component_path" MCA parameter and ensure that it has at         
>>                           
>>least one directory that contains valid MCA components.                       
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>[node153:44440] PML cm cannot be selected                                     
>>                           
>>-------------------------------------------------------                       
>>                           
>>Primary job  terminated normally, but 1 process returned                      
>>                           
>>a non-zero exit code.. Per user-direction, the job has been aborted.          
>>                           
>>-------------------------------------------------------                       
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>mpirun detected that one or more processes exited with non-zero status, thus 
>>causing                     
>>the job to be terminated. The first process to do so was:                     
>>                           
>>                                                                              
>>                           
>>  Process name: [[43917,1],0]                                                 
>>                           
>>  Exit code:    1                                                             
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>[login:110455] 1 more process has sent help message help-mca-base.txt / 
>>find-available:not-valid         
>>[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>help / error messages        
>>[login:110455] 1 more process has sent help message help-mca-base.txt / 
>>find-available:none-found        
>>                            
>>3.b) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix 
>>$HPCX_MPI_DIR ./hello
>>--------------------------------------------------------------------------    
>>                           
>>A requested component was not found, or was unable to be opened.  This        
>>                           
>>means that this component is either not installed or is unable to be          
>>                           
>>used on your system (e.g., sometimes this means that shared libraries         
>>                           
>>that the component requires are unable to be found/loaded).  Note that        
>>                           
>>Open MPI stopped checking at the first component that it did not find.        
>>                           
>>                                                                              
>>                           
>>Host:      node153                                                            
>>                           
>>Framework: pml                                                                
>>                           
>>Component: yalla                                                              
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>*** An error occurred in MPI_Init                                             
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>It looks like MPI_INIT failed for some reason; your parallel process is       
>>                           
>>likely to abort.  There are many reasons that a parallel process can          
>>                           
>>fail during MPI_INIT; some of which are due to configuration or environment   
>>                           
>>problems.  This failure appears to be an internal failure; here's some        
>>                           
>>additional information (which may only be relevant to an Open MPI             
>>                           
>>developer):                                                                   
>>                           
>>                                                                              
>>                           
>>  mca_pml_base_open() failed                                                  
>>                           
>>  --> Returned "Not found" (-13) instead of "Success" (0)                     
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>*** on a NULL communicator                                                    
>>                           
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>>                           
>>***    and potentially your MPI job)                                          
>>                           
>>[node153:43979] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages,
>> and not able to guarantee that all other processes were killed!              
>>                           
>>-------------------------------------------------------                       
>>                           
>>Primary job  terminated normally, but 1 process returned                      
>>                           
>>a non-zero exit code.. Per user-direction, the job has been aborted.          
>>                           
>>-------------------------------------------------------                       
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>mpirun detected that one or more processes exited with non-zero status, thus 
>>causing                     
>>the job to be terminated. The first process to do so was:                     
>>                           
>>                                                                              
>>                           
>>  Process name: [[44992,1],1]                                                 
>>                           
>>  Exit code:    1                                                             
>>                           
>>--------------------------------------------------------------------------    
>>                           
>>
>>
>>
>>host:$  echo $HPCX_MPI_DIR                                                    
>>                      
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8
>>host:$ ompi_info | grep pml                                                   
>>                      
>>                 MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5)            
>>                           
>>                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5)           
>>                           
>>                 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5)          
>>                           
>>                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5)          
>>                           
>>                 MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5)  
>>host: tests$  ompi_info | grep mtl                                   
>>                 MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5)
>>
>>P.S.
>>possible error in the FAQ? ( 
>>http://www.open-mpi.org/faq/?category=openfabrics#mxm )
>>47. Does Open MPI support MXM?
>>............
>>NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 
>>and above
>>...........
>>But here we have(or not...) yalla in ompi 1.8.5
>>
>>
>>
>>Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il >:
>>>Hi Timur,
>>>
>>>Here it goes:
>>>
>>>wget  
>>>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz
>>>
>>>Please let me know if it works for you and will add 1.5.4.1 mofed to the 
>>>default distribution list.
>>>
>>>M
>>>
>>>
>>>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>Thanks a lot .
>>>>
>>>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < 
>>>>mi...@dev.mellanox.co.il >:
>>>>
>>>>>will send u the link tomorrow.
>>>>>
>>>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>wrote:
>>>>>>Where can i find MXM for ofed 1.5.4.1?
>>>>>>
>>>>>>
>>>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < 
>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>
>>>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed 
>>>>>>>1.5.3
>>>>>>>
>>>>>>>seems like ABI issue between ofed versions
>>>>>>>
>>>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>>>wrote:
>>>>>>>>I did as you said, but got an error:
>>>>>>>>
>>>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>>>node1$  ./mxm_perftest                                                  
>>>>>>>>                          
>>>>>>>>Waiting for connection...                                               
>>>>>>>>                                 
>>>>>>>>Accepted connection from 10.65.0.253                                    
>>>>>>>>                                 
>>>>>>>>[1432576262.370195] [node153:35388:0]         shm.c:65   MXM  WARN  
>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or 
>>>>>>>>directory. Won't use knem.                                              
>>>>>>>>   
>>>>>>>>Failed to create endpoint: No such device                               
>>>>>>>>                                 
>>>>>>>>
>>>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>>>node2$ ./mxm_perftest node1  -t send_lat                                
>>>>>>>>                       
>>>>>>>>[1432576262.367523] [node158:99366:0]         shm.c:65   MXM  WARN  
>>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or 
>>>>>>>>directory. Won't use knem.
>>>>>>>>Failed to create endpoint: No such device
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < 
>>>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>>>scif is a OFA device from Intel.
>>>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
>>>>>>>>>
>>>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov  < tismagi...@mail.ru 
>>>>>>>>>> wrote:
>>>>>>>>>>Hi, Mike,
>>>>>>>>>>that is what i have:
>>>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n"
>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>>>>>>>>>>               
>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>>>>>>>>>>             
>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>>>>>>>>>>               
>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>>>>>>>>>> +intel compiler paths
>>>>>>>>>>
>>>>>>>>>>$ echo $OPAL_PREFIX                                                   
>>>>>>>>>>                  
>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>>>>>>>>>>
>>>>>>>>>>I don't use LD_PRELOAD.
>>>>>>>>>>
>>>>>>>>>>In the attached file(ompi_info.out) you will find the output of 
>>>>>>>>>>ompi_info -l 9  command.
>>>>>>>>>>
>>>>>>>>>>P.S . 
>>>>>>>>>>node1 $ ./mxm_perftest
>>>>>>>>>>node2 $  ./mxm_perftest node1  -t send_lat
>>>>>>>>>>[1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  
>>>>>>>>>>Could not open the KNEM device file $t /dev/knem : No such file or 
>>>>>>>>>>directory. Won't use knem.          ( I don't have knem)
>>>>>>>>>>[1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN  
>>>>>>>>>>skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a 
>>>>>>>>>>Mellanox device                                (???)
>>>>>>>>>>Failed to create endpoint: No such device
>>>>>>>>>>
>>>>>>>>>>$  ibv_devinfo                                         
>>>>>>>>>>hca_id: mlx4_0                                                  
>>>>>>>>>>        transport:                      InfiniBand (0)          
>>>>>>>>>>        fw_ver:                         2.10.600                
>>>>>>>>>>        node_guid:                      0002:c903:00a1:13b0     
>>>>>>>>>>        sys_image_guid:                 0002:c903:00a1:13b3     
>>>>>>>>>>        vendor_id:                      0x02c9                  
>>>>>>>>>>        vendor_part_id:                 4099                    
>>>>>>>>>>        hw_ver:                         0x0                     
>>>>>>>>>>        board_id:                       MT_1090120019           
>>>>>>>>>>        phys_port_cnt:                  2                       
>>>>>>>>>>                port:   1                                       
>>>>>>>>>>                        state:                  PORT_ACTIVE (4) 
>>>>>>>>>>                        max_mtu:                4096 (5)        
>>>>>>>>>>                        active_mtu:             4096 (5)        
>>>>>>>>>>                        sm_lid:                 1               
>>>>>>>>>>                        port_lid:               83              
>>>>>>>>>>                        port_lmc:               0x00            
>>>>>>>>>>                                                                
>>>>>>>>>>                port:   2                                       
>>>>>>>>>>                        state:                  PORT_DOWN (1)   
>>>>>>>>>>                        max_mtu:                4096 (5)        
>>>>>>>>>>                        active_mtu:             4096 (5)        
>>>>>>>>>>                        sm_lid:                 0               
>>>>>>>>>>                        port_lid:               0               
>>>>>>>>>>                        port_lmc:               0x00            
>>>>>>>>>>
>>>>>>>>>>Best regards,
>>>>>>>>>>Timur.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < 
>>>>>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>>>>>Hi Timur,
>>>>>>>>>>>seems that yalla component was not found in your OMPI tree.
>>>>>>>>>>>can it be that your mpirun is not from hpcx? Can you please check 
>>>>>>>>>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing 
>>>>>>>>>>>to the right mpirun?
>>>>>>>>>>>
>>>>>>>>>>>Also, could you please check that yalla is present in the ompi_info 
>>>>>>>>>>>-l 9 output?
>>>>>>>>>>>
>>>>>>>>>>>Thanks
>>>>>>>>>>>
>>>>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  < 
>>>>>>>>>>>tismagi...@mail.ru > wrote:
>>>>>>>>>>>>I can password-less ssh to all nodes:
>>>>>>>>>>>>base$ ssh node1
>>>>>>>>>>>>node1$ssh node2
>>>>>>>>>>>>Last login: Mon May 25 18:41:23 
>>>>>>>>>>>>node2$ssh node3
>>>>>>>>>>>>Last login: Mon May 25 16:25:01
>>>>>>>>>>>>node3$ssh node4
>>>>>>>>>>>>Last login: Mon May 25 16:27:04
>>>>>>>>>>>>node4$
>>>>>>>>>>>>
>>>>>>>>>>>>Is this correct?
>>>>>>>>>>>>
>>>>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < 
>>>>>>>>>>>>r...@open-mpi.org >:
>>>>>>>>>>>>
>>>>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue 
>>>>>>>>>>>>>indicates that you don’t have password-less ssh authorized between 
>>>>>>>>>>>>>the compute nodes
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>Hello!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>>>>>>>>>>>>>OFED-1.5.4.1;
>>>>>>>>>>>>>>CentOS release 6.2;
>>>>>>>>>>>>>>infiniband 4x FDR
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>I have two problems:
>>>>>>>>>>>>>>1. I can not use mxm :
>>>>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host 
>>>>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 
>>>>>>>>>>>>>>./hello
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>A requested component was not found, or was unable to be opened.  
>>>>>>>>>>>>>>This                                   
>>>>>>>>>>>>>>means that this component is either not installed or is unable to 
>>>>>>>>>>>>>>be                                     
>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared 
>>>>>>>>>>>>>>libraries                                    
>>>>>>>>>>>>>>that the component requires are unable to be found/loaded).  Note 
>>>>>>>>>>>>>>that                                   
>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did not 
>>>>>>>>>>>>>>find.                                   
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>Host:      node14                                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>Framework: pml                                                    
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>Component: yalla                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>*** An error occurred in MPI_Init                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel 
>>>>>>>>>>>>>>process is                                  
>>>>>>>>>>>>>>likely to abort.  There are many reasons that a parallel process 
>>>>>>>>>>>>>>can                                     
>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>>>>>>>>>environment                              
>>>>>>>>>>>>>>problems.  This failure appears to be an internal failure; here's 
>>>>>>>>>>>>>>some                                   
>>>>>>>>>>>>>>additional information (which may only be relevant to an Open MPI 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>developer):                                                       
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  mca_pml_base_open() failed                                      
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)         
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>*** on a NULL communicator                                        
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>>>***    and potentially your MPI job)                              
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** An error occurred in MPI_Init                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>> and not able to guarantee that all other processes were killed!  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** on a NULL communicator                                        
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>>>***    and potentially your MPI job)                              
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>> and not able to guarantee that all other processes were killed!  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** An error occurred in MPI_Init                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** on a NULL communicator                                        
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>>>***    and potentially your MPI job)                              
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>successfully; not able to aggregate error messages, 
>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed!   
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** An error occurred in MPI_Init                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** on a NULL communicator                                        
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>>>abort,                                 
>>>>>>>>>>>>>>***    and potentially your MPI job)                              
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>successfully; not able to aggregate error messages, 
>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed!   
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>-------------------------------------------------------           
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>Primary job  terminated normally, but 1 process returned          
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been 
>>>>>>>>>>>>>>aborted.                                     
>>>>>>>>>>>>>>-------------------------------------------------------           
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero 
>>>>>>>>>>>>>>status, thus causing                     
>>>>>>>>>>>>>>the job to be terminated. The first process to do so was:         
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  Process name: [[9372,1],2]
>>>>>>>>>>>>>>  Exit code:    1                                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message 
>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid       
>>>>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 
>>>>>>>>>>>>>>to see all help / error messages         
>>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message 
>>>>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur
>>>>>>>>>>>>>>e                                                                 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
>>>>>>>>>>>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>A requested component was not found, or was unable to be opened.  
>>>>>>>>>>>>>>This                                  
>>>>>>>>>>>>>>means that this component is either not installed or is unable to 
>>>>>>>>>>>>>>be                                    
>>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared 
>>>>>>>>>>>>>>libraries                                   
>>>>>>>>>>>>>>that the component requires are unable to be found/loaded).  Note 
>>>>>>>>>>>>>>that                                  
>>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did not 
>>>>>>>>>>>>>>find.                                  
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>Host:      node5                                                  
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>Framework: pml                                                    
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>Component: yalla                                                  
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>*** An error occurred in MPI_Init                                 
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>*** on a NULL communicator                                        
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>>>abort,                                
>>>>>>>>>>>>>>***    and potentially your MPI job)                              
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed!   
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel 
>>>>>>>>>>>>>>process is                                 
>>>>>>>>>>>>>>likely to abort.  There are many reasons that a parallel process 
>>>>>>>>>>>>>>can                                    
>>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>>>>>>>>>environment                             
>>>>>>>>>>>>>>problems.  This failure appears to be an internal failure; here's 
>>>>>>>>>>>>>>some                                  
>>>>>>>>>>>>>>additional information (which may only be relevant to an Open MPI 
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>developer):                                                       
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>  mca_pml_base_open() failed                                      
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)         
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>-------------------------------------------------------           
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>Primary job  terminated normally, but 1 process returned          
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been 
>>>>>>>>>>>>>>aborted.                                    
>>>>>>>>>>>>>>-------------------------------------------------------           
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>*** An error occurred in MPI_Init                                 
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>*** on a NULL communicator                                        
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>>>>>>>abort,                                
>>>>>>>>>>>>>>***    and potentially your MPI job)                              
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed 
>>>>>>>>>>>>>>successfully; not able to aggregate error messages,
>>>>>>>>>>>>>>and not able to guarantee that all other processes were killed!   
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero 
>>>>>>>>>>>>>>status, thus causing                    
>>>>>>>>>>>>>>the job to be terminated. The first process to do so was:         
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>  Process name: [[9619,1],0]                                      
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>  Exit code:    1                                                 
>>>>>>>>>>>>>>                                      
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>[login:08552] 1 more process has sent help message 
>>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid         
>>>>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 
>>>>>>>>>>>>>>to see all help / error messages        
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun cmd 
>>>>>>>>>>>>>>line :
>>>>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello
>>>>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn' 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
>>>>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
>>>>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; 
>>>>>>>>>>>>>>export OPAL_PREFIX; PATH=/gpfs/NETHOME/o
>>>>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>>>>>>>>>>>>>> ; export PA
>>>>>>>>>>>>>>TH ; 
>>>>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
>>>>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
>>>>>>>>>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
>>>>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>>>>>>>>>>>>>> ; expor
>>>>>>>>>>>>>>t DYLD_LIBRARY_PATH ;   
>>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
>>>>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 
>>>>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
>>>>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs 
>>>>>>>>>>>>>>"5" -mca orte_parent_uri "625606656.1;tc
>>>>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri 
>>>>>>>>>>>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8
>>>>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca 
>>>>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
>>>>>>>>>>>>>>pawn'                                                             
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>ORTE was unable to reliably start one or more daemons.            
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>This usually is caused by:                                        
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>* not finding the required libraries and/or binaries on           
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH   
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  settings, or configure OMPI with 
>>>>>>>>>>>>>>--enable-orterun-prefix-by-default                                
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>* lack of authority to execute on one or more specified nodes.    
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  Please verify your allocation and authorities.                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>* the inability to write startup files into /tmp 
>>>>>>>>>>>>>>(--tmpdir/orte_tmpdir_base).                            
>>>>>>>>>>>>>>  Please check with your sys admin to determine the correct 
>>>>>>>>>>>>>>location to use.                             
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>*  compilation of the orted with dynamic libraries when static 
>>>>>>>>>>>>>>are required                              
>>>>>>>>>>>>>>  (e.g., on Cray). Please check your configure cmd line and 
>>>>>>>>>>>>>>consider using                               
>>>>>>>>>>>>>>  one of the contrib/platform definitions for your system type.   
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>* an inability to create a connection back to mpirun due to a     
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  lack of common network interfaces and/or no route found between 
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  them. Please check network connectivity (including firewalls    
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>  and network routing requirements).                              
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>--------------------------------------------------------------------------
>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to 
>>>>>>>>>>>>>>forcibly terminate                          
>>>>>>>>>>>>>>                                                                  
>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>Thank you for your comments.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>Best regards,
>>>>>>>>>>>>>>Timur.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>users mailing list
>>>>>>>>>>>>>>us...@open-mpi.org
>>>>>>>>>>>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>Link to this post:  
>>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>users mailing list
>>>>>>>>>>>>us...@open-mpi.org
>>>>>>>>>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>Link to this post:  
>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>-- 
>>>>>>>>>>>
>>>>>>>>>>>Kind Regards,
>>>>>>>>>>>
>>>>>>>>>>>M.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>-- 
>>>>>>>>>
>>>>>>>>>Kind Regards,
>>>>>>>>>
>>>>>>>>>M.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-- 
>>>>>>>
>>>>>>>Kind Regards,
>>>>>>>
>>>>>>>M.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-- 
>>>>>
>>>>>Kind Regards,
>>>>>
>>>>>M.
>>>>
>>>>
>>>
>>>
>>>
>>>-- 
>>>
>>>Kind Regards,
>>>
>>>M.
>>
>>
>
>
>
>-- 
>
>Kind Regards,
>
>M.



Attachment: cm_mxm.out
Description: Binary data

Attachment: yalla.out
Description: Binary data

Reply via email to