[OMPI users] Fwd: Re[4]: MXM problem

Timur Ismagilov Mon, 25 May 2015 14:09:46 -0400 (EDT)

 I did as you said, but got an error:

node1$ export MXM_IB_PORTS=mlx4_0:1
node1$  ./mxm_perftest                                                          
                  
Waiting for connection...                                                       
                         
Accepted connection from 10.65.0.253                                            
                         
[1432576262.370195] [node153:35388:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.                                                 
Failed to create endpoint: No such device


node2$ export MXM_IB_PORTS=mlx4_0:1
node2$ ./mxm_perftest node1  -t send_lat                                        
               
[1432576262.367523] [node158:99366:0]         shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.
Failed to create endpoint: No such device




Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman 
<mi...@dev.mellanox.co.il>:
>scif is a OFA device from Intel.
>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
>
>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Hi, Mike,
>>that is what i have:
>>$ echo $LD_LIBRARY_PATH | tr ":" "\n"
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>>               
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>>             
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>>               
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>> +intel compiler paths
>>
>>$ echo $OPAL_PREFIX                                                           
>>          
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>>
>>I don't use LD_PRELOAD.
>>
>>In the attached file(ompi_info.out) you will find the output of ompi_info -l 
>>9  command.
>>
>>P.S . 
>>node1 $ ./mxm_perftest
>>node2 $  ./mxm_perftest node1  -t send_lat
>>[1432568685.067067] [node151:87372:0]         shm.c:65   MXM  WARN  Could not 
>>open the KNEM device file $t /dev/knem : No such file or directory. Won't use 
>>knem.          ( I don't have knem)
>>[1432568685.069699] [node151:87372:0]      ib_dev.c:531  MXM  WARN  skipping 
>>device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device         
>>                       (???)
>>Failed to create endpoint: No such device
>>
>>$  ibv_devinfo                                         
>>hca_id: mlx4_0                                                  
>>        transport:                      InfiniBand (0)          
>>        fw_ver:                         2.10.600                
>>        node_guid:                      0002:c903:00a1:13b0     
>>        sys_image_guid:                 0002:c903:00a1:13b3     
>>        vendor_id:                      0x02c9                  
>>        vendor_part_id:                 4099                    
>>        hw_ver:                         0x0                     
>>        board_id:                       MT_1090120019           
>>        phys_port_cnt:                  2                       
>>                port:   1                                       
>>                        state:                  PORT_ACTIVE (4) 
>>                        max_mtu:                4096 (5)        
>>                        active_mtu:             4096 (5)        
>>                        sm_lid:                 1               
>>                        port_lid:               83              
>>                        port_lmc:               0x00            
>>                                                                
>>                port:   2                                       
>>                        state:                  PORT_DOWN (1)   
>>                        max_mtu:                4096 (5)        
>>                        active_mtu:             4096 (5)        
>>                        sm_lid:                 0               
>>                        port_lid:               0               
>>                        port_lmc:               0x00            
>>
>>Best regards,
>>Timur.
>>
>>
>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < 
>>mi...@dev.mellanox.co.il >:
>>>Hi Timur,
>>>seems that yalla component was not found in your OMPI tree.
>>>can it be that your mpirun is not from hpcx? Can you please check 
>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the 
>>>right mpirun?
>>>
>>>Also, could you please check that yalla is present in the ompi_info -l 9 
>>>output?
>>>
>>>Thanks
>>>
>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>I can password-less ssh to all nodes:
>>>>base$ ssh node1
>>>>node1$ssh node2
>>>>Last login: Mon May 25 18:41:23 
>>>>node2$ssh node3
>>>>Last login: Mon May 25 16:25:01
>>>>node3$ssh node4
>>>>Last login: Mon May 25 16:27:04
>>>>node4$
>>>>
>>>>Is this correct?
>>>>
>>>>In ompi-1.9 i do not have no-tree-spawn problem.
>>>>
>>>>
>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < r...@open-mpi.org 
>>>>>:
>>>>
>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates 
>>>>>that you don’t have password-less ssh authorized between the compute nodes
>>>>>
>>>>>
>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>>>Hello!
>>>>>>
>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>>>>>OFED-1.5.4.1;
>>>>>>CentOS release 6.2;
>>>>>>infiniband 4x FDR
>>>>>>
>>>>>>
>>>>>>
>>>>>>I have two problems:
>>>>>>1. I can not use mxm :
>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 
>>>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>A requested component was not found, or was unable to be opened.  This    
>>>>>>                               
>>>>>>means that this component is either not installed or is unable to be      
>>>>>>                               
>>>>>>used on your system (e.g., sometimes this means that shared libraries     
>>>>>>                               
>>>>>>that the component requires are unable to be found/loaded).  Note that    
>>>>>>                               
>>>>>>Open MPI stopped checking at the first component that it did not find.    
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>Host:      node14                                                         
>>>>>>                               
>>>>>>Framework: pml                                                            
>>>>>>                               
>>>>>>Component: yalla                                                          
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>*** An error occurred in MPI_Init                                         
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>It looks like MPI_INIT failed for some reason; your parallel process is   
>>>>>>                               
>>>>>>likely to abort.  There are many reasons that a parallel process can      
>>>>>>                               
>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>environment                              
>>>>>>problems.  This failure appears to be an internal failure; here's some    
>>>>>>                               
>>>>>>additional information (which may only be relevant to an Open MPI         
>>>>>>                               
>>>>>>developer):                                                               
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>  mca_pml_base_open() failed                                              
>>>>>>                               
>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)                 
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>*** on a NULL communicator                                                
>>>>>>                               
>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>>>>>                               
>>>>>>***    and potentially your MPI job)                                      
>>>>>>                               
>>>>>>*** An error occurred in MPI_Init                                         
>>>>>>                               
>>>>>>[node28:102377] Local abort before MPI_INIT completed successfully; not 
>>>>>>able to aggregate error messages,
>>>>>> and not able to guarantee that all other processes were killed!          
>>>>>>                               
>>>>>>*** on a NULL communicator                                                
>>>>>>                               
>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>>>>>                               
>>>>>>***    and potentially your MPI job)                                      
>>>>>>                               
>>>>>>[node29:105600] Local abort before MPI_INIT completed successfully; not 
>>>>>>able to aggregate error messages,
>>>>>> and not able to guarantee that all other processes were killed!          
>>>>>>                               
>>>>>>*** An error occurred in MPI_Init                                         
>>>>>>                               
>>>>>>*** on a NULL communicator                                                
>>>>>>                               
>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>>>>>                               
>>>>>>***    and potentially your MPI job)                                      
>>>>>>                               
>>>>>>[node5:102409] Local abort before MPI_INIT completed successfully; not 
>>>>>>able to aggregate error messages, 
>>>>>>and not able to guarantee that all other processes were killed!           
>>>>>>                               
>>>>>>*** An error occurred in MPI_Init                                         
>>>>>>                               
>>>>>>*** on a NULL communicator                                                
>>>>>>                               
>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>>>>>                               
>>>>>>***    and potentially your MPI job)                                      
>>>>>>                               
>>>>>>[node14:85284] Local abort before MPI_INIT completed successfully; not 
>>>>>>able to aggregate error messages, 
>>>>>>and not able to guarantee that all other processes were killed!           
>>>>>>                               
>>>>>>-------------------------------------------------------                   
>>>>>>                               
>>>>>>Primary job  terminated normally, but 1 process returned                  
>>>>>>                               
>>>>>>a non-zero exit code.. Per user-direction, the job has been aborted.      
>>>>>>                               
>>>>>>-------------------------------------------------------                   
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>mpirun detected that one or more processes exited with non-zero status, 
>>>>>>thus causing                     
>>>>>>the job to be terminated. The first process to do so was:                 
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>  Process name: [[9372,1],2]
>>>>>>  Exit code:    1                                                         
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>[login:08295] 3 more processes have sent help message help-mca-base.txt / 
>>>>>>find-available:not-valid       
>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>>>>>>all help / error messages         
>>>>>>[login:08295] 3 more processes have sent help message help-mpi-runtime / 
>>>>>>mpi_init:startup:internal-failur
>>>>>>e                                                                         
>>>>>>                               
>>>>>>
>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
>>>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>>>--------------------------------------------------------------------------
>>>>>>                              
>>>>>>A requested component was not found, or was unable to be opened.  This    
>>>>>>                              
>>>>>>means that this component is either not installed or is unable to be      
>>>>>>                              
>>>>>>used on your system (e.g., sometimes this means that shared libraries     
>>>>>>                              
>>>>>>that the component requires are unable to be found/loaded).  Note that    
>>>>>>                              
>>>>>>Open MPI stopped checking at the first component that it did not find.    
>>>>>>                              
>>>>>>                                                                          
>>>>>>                              
>>>>>>Host:      node5                                                          
>>>>>>                              
>>>>>>Framework: pml                                                            
>>>>>>                              
>>>>>>Component: yalla                                                          
>>>>>>                              
>>>>>>--------------------------------------------------------------------------
>>>>>>                              
>>>>>>*** An error occurred in MPI_Init                                         
>>>>>>                              
>>>>>>*** on a NULL communicator                                                
>>>>>>                              
>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>>>>>                              
>>>>>>***    and potentially your MPI job)                                      
>>>>>>                              
>>>>>>[node5:102449] Local abort before MPI_INIT completed successfully; not 
>>>>>>able to aggregate error messages,
>>>>>>and not able to guarantee that all other processes were killed!           
>>>>>>                              
>>>>>>--------------------------------------------------------------------------
>>>>>>                              
>>>>>>It looks like MPI_INIT failed for some reason; your parallel process is   
>>>>>>                              
>>>>>>likely to abort.  There are many reasons that a parallel process can      
>>>>>>                              
>>>>>>fail during MPI_INIT; some of which are due to configuration or 
>>>>>>environment                             
>>>>>>problems.  This failure appears to be an internal failure; here's some    
>>>>>>                              
>>>>>>additional information (which may only be relevant to an Open MPI         
>>>>>>                              
>>>>>>developer):                                                               
>>>>>>                              
>>>>>>                                                                          
>>>>>>                              
>>>>>>  mca_pml_base_open() failed                                              
>>>>>>                              
>>>>>>  --> Returned "Not found" (-13) instead of "Success" (0)                 
>>>>>>                              
>>>>>>--------------------------------------------------------------------------
>>>>>>                              
>>>>>>-------------------------------------------------------                   
>>>>>>                              
>>>>>>Primary job  terminated normally, but 1 process returned                  
>>>>>>                              
>>>>>>a non-zero exit code.. Per user-direction, the job has been aborted.      
>>>>>>                              
>>>>>>-------------------------------------------------------                   
>>>>>>                              
>>>>>>*** An error occurred in MPI_Init                                         
>>>>>>                              
>>>>>>*** on a NULL communicator                                                
>>>>>>                              
>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>>>>>                              
>>>>>>***    and potentially your MPI job)                                      
>>>>>>                              
>>>>>>[node14:85325] Local abort before MPI_INIT completed successfully; not 
>>>>>>able to aggregate error messages,
>>>>>>and not able to guarantee that all other processes were killed!           
>>>>>>                              
>>>>>>--------------------------------------------------------------------------
>>>>>>                              
>>>>>>mpirun detected that one or more processes exited with non-zero status, 
>>>>>>thus causing                    
>>>>>>the job to be terminated. The first process to do so was:                 
>>>>>>                              
>>>>>>                                                                          
>>>>>>                              
>>>>>>  Process name: [[9619,1],0]                                              
>>>>>>                              
>>>>>>  Exit code:    1                                                         
>>>>>>                              
>>>>>>--------------------------------------------------------------------------
>>>>>>                              
>>>>>>[login:08552] 1 more process has sent help message help-mca-base.txt / 
>>>>>>find-available:not-valid         
>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>>>>>>all help / error messages        
>>>>>>
>>>>>>2. I can not remove  -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line :
>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello
>>>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn'         
>>>>>>                               
>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export 
>>>>>>OPAL_PREFIX; PATH=/gpfs/NETHOME/o
>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>>>>>> ; export PA
>>>>>>TH ; 
>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>>>>>> ; expor
>>>>>>t DYLD_LIBRARY_PATH ;   
>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 
>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es
>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca 
>>>>>>orte_parent_uri "625606656.1;tc
>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri 
>>>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8
>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca 
>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s
>>>>>>pawn'                                                                     
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>ORTE was unable to reliably start one or more daemons.                    
>>>>>>                               
>>>>>>This usually is caused by:                                                
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>* not finding the required libraries and/or binaries on                   
>>>>>>                               
>>>>>>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH           
>>>>>>                               
>>>>>>  settings, or configure OMPI with --enable-orterun-prefix-by-default     
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>* lack of authority to execute on one or more specified nodes.            
>>>>>>                               
>>>>>>  Please verify your allocation and authorities.                          
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>* the inability to write startup files into /tmp 
>>>>>>(--tmpdir/orte_tmpdir_base).                            
>>>>>>  Please check with your sys admin to determine the correct location to 
>>>>>>use.                             
>>>>>>                                                                          
>>>>>>                               
>>>>>>*  compilation of the orted with dynamic libraries when static are 
>>>>>>required                              
>>>>>>  (e.g., on Cray). Please check your configure cmd line and consider 
>>>>>>using                               
>>>>>>  one of the contrib/platform definitions for your system type.           
>>>>>>                               
>>>>>>                                                                          
>>>>>>                               
>>>>>>* an inability to create a connection back to mpirun due to a             
>>>>>>                               
>>>>>>  lack of common network interfaces and/or no route found between         
>>>>>>                               
>>>>>>  them. Please check network connectivity (including firewalls            
>>>>>>                               
>>>>>>  and network routing requirements).                                      
>>>>>>                               
>>>>>>--------------------------------------------------------------------------
>>>>>>                               
>>>>>>mpirun: abort is already in progress...hit ctrl-c again to forcibly 
>>>>>>terminate                          
>>>>>>                                                                          
>>>>>>                               
>>>>>>Thank you for your comments.
>>>>>> 
>>>>>>Best regards,
>>>>>>Timur.
>>>>>> 
>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>users mailing list
>>>>>>us...@open-mpi.org
>>>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>Link to this post:  
>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>_______________________________________________
>>>>users mailing list
>>>>us...@open-mpi.org
>>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>Link to this post:  
>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php
>>>
>>>
>>>
>>>-- 
>>>
>>>Kind Regards,
>>>
>>>M.
>>
>>
>
>
>
>-- 
>
>Kind Regards,
>
>M.



----------------------------------------------------------------------

[OMPI users] Fwd: Re[4]: MXM problem

Reply via email to