It does not work for single node: 1) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm --prefix $HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 --debug-daemons -np 1 ./hello &> yalla.out 2) host: $ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off -host node5 --mca pml cm --mca mtl mxm --prefix $HPCX_MPI_DIR -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 --debug-daemons -np 1 ./hello &> cm_mxm.out
I've attached the yalla.out and cm_mxm.out to this email. Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman <mi...@dev.mellanox.co.il>: >does it work from single node? >could you please run with opts below and attach output? > > -mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 >--debug-daemons > >On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov < tismagi...@mail.ru > >wrote: >>1. mxm_perf_test - OK. >>2. no_tree_spawn - OK. >>3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I use >>prebuild ompi-1.8.5 from hpcx-v1.3.330) >>3.a) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>MXM_SHM_KCOPY_MODE=off -host node5,node153 --mca pml cm --mca mtl mxm >>--prefix $HPCX_MPI_DIR ./hello >>-------------------------------------------------------------------------- >> >>A requested component was not found, or was unable to be opened. This >> >>means that this component is either not installed or is unable to be >> >>used on your system (e.g., sometimes this means that shared libraries >> >>that the component requires are unable to be found/loaded). Note that >> >>Open MPI stopped checking at the first component that it did not find. >> >> >> >>Host: node153 >> >>Framework: mtl >> >>Component: mxm >> >>-------------------------------------------------------------------------- >> >>[node5:113560] PML cm cannot be selected >> >>-------------------------------------------------------------------------- >> >>No available pml components were found! >> >> >> >>This means that there are no components of this type installed on your >> >>system or all the components reported that they could not be used. >> >> >> >>This is a fatal error; your MPI process is likely to abort. Check the >> >>output of the "ompi_info" command and ensure that components of this >> >>type are available on your system. You may also wish to check the >> >>value of the "component_path" MCA parameter and ensure that it has at >> >>least one directory that contains valid MCA components. >> >>-------------------------------------------------------------------------- >> >>[node153:44440] PML cm cannot be selected >> >>------------------------------------------------------- >> >>Primary job terminated normally, but 1 process returned >> >>a non-zero exit code.. Per user-direction, the job has been aborted. >> >>------------------------------------------------------- >> >>-------------------------------------------------------------------------- >> >>mpirun detected that one or more processes exited with non-zero status, thus >>causing >>the job to be terminated. The first process to do so was: >> >> >> >> Process name: [[43917,1],0] >> >> Exit code: 1 >> >>-------------------------------------------------------------------------- >> >>[login:110455] 1 more process has sent help message help-mca-base.txt / >>find-available:not-valid >>[login:110455] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>help / error messages >>[login:110455] 1 more process has sent help message help-mca-base.txt / >>find-available:none-found >> >>3.b) host:$ $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x >>MXM_SHM_KCOPY_MODE=off -host node5,node153 -mca pml yalla --prefix >>$HPCX_MPI_DIR ./hello >>-------------------------------------------------------------------------- >> >>A requested component was not found, or was unable to be opened. This >> >>means that this component is either not installed or is unable to be >> >>used on your system (e.g., sometimes this means that shared libraries >> >>that the component requires are unable to be found/loaded). Note that >> >>Open MPI stopped checking at the first component that it did not find. >> >> >> >>Host: node153 >> >>Framework: pml >> >>Component: yalla >> >>-------------------------------------------------------------------------- >> >>*** An error occurred in MPI_Init >> >>-------------------------------------------------------------------------- >> >>It looks like MPI_INIT failed for some reason; your parallel process is >> >>likely to abort. There are many reasons that a parallel process can >> >>fail during MPI_INIT; some of which are due to configuration or environment >> >>problems. This failure appears to be an internal failure; here's some >> >>additional information (which may only be relevant to an Open MPI >> >>developer): >> >> >> >> mca_pml_base_open() failed >> >> --> Returned "Not found" (-13) instead of "Success" (0) >> >>-------------------------------------------------------------------------- >> >>*** on a NULL communicator >> >>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> >>*** and potentially your MPI job) >> >>[node153:43979] Local abort before MPI_INIT completed successfully; not able >>to aggregate error messages, >> and not able to guarantee that all other processes were killed! >> >>------------------------------------------------------- >> >>Primary job terminated normally, but 1 process returned >> >>a non-zero exit code.. Per user-direction, the job has been aborted. >> >>------------------------------------------------------- >> >>-------------------------------------------------------------------------- >> >>mpirun detected that one or more processes exited with non-zero status, thus >>causing >>the job to be terminated. The first process to do so was: >> >> >> >> Process name: [[44992,1],1] >> >> Exit code: 1 >> >>-------------------------------------------------------------------------- >> >> >> >> >>host:$ echo $HPCX_MPI_DIR >> >>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8 >>host:$ ompi_info | grep pml >> >> MCA pml: v (MCA v2.0, API v2.0, Component v1.8.5) >> >> MCA pml: cm (MCA v2.0, API v2.0, Component v1.8.5) >> >> MCA pml: bfo (MCA v2.0, API v2.0, Component v1.8.5) >> >> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.8.5) >> >> MCA pml: yalla (MCA v2.0, API v2.0, Component v1.8.5) >>host: tests$ ompi_info | grep mtl >> MCA mtl: mxm (MCA v2.0, API v2.0, Component v1.8.5) >> >>P.S. >>possible error in the FAQ? ( >>http://www.open-mpi.org/faq/?category=openfabrics#mxm ) >>47. Does Open MPI support MXM? >>............ >>NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 >>and above >>........... >>But here we have(or not...) yalla in ompi 1.8.5 >> >> >> >>Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il >: >>>Hi Timur, >>> >>>Here it goes: >>> >>>wget >>>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz >>> >>>Please let me know if it works for you and will add 1.5.4.1 mofed to the >>>default distribution list. >>> >>>M >>> >>> >>>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov < tismagi...@mail.ru > >>>wrote: >>>>Thanks a lot . >>>> >>>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < >>>>mi...@dev.mellanox.co.il >: >>>> >>>>>will send u the link tomorrow. >>>>> >>>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>wrote: >>>>>>Where can i find MXM for ofed 1.5.4.1? >>>>>> >>>>>> >>>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < >>>>>>mi...@dev.mellanox.co.il >: >>>>>> >>>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed >>>>>>>1.5.3 >>>>>>> >>>>>>>seems like ABI issue between ofed versions >>>>>>> >>>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>wrote: >>>>>>>>I did as you said, but got an error: >>>>>>>> >>>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>>>node1$ ./mxm_perftest >>>>>>>> >>>>>>>>Waiting for connection... >>>>>>>> >>>>>>>>Accepted connection from 10.65.0.253 >>>>>>>> >>>>>>>>[1432576262.370195] [node153:35388:0] shm.c:65 MXM WARN >>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or >>>>>>>>directory. Won't use knem. >>>>>>>> >>>>>>>>Failed to create endpoint: No such device >>>>>>>> >>>>>>>> >>>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1 >>>>>>>>node2$ ./mxm_perftest node1 -t send_lat >>>>>>>> >>>>>>>>[1432576262.367523] [node158:99366:0] shm.c:65 MXM WARN >>>>>>>>Could not open the KNEM device file at /dev/knem : No such file or >>>>>>>>directory. Won't use knem. >>>>>>>>Failed to create endpoint: No such device >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < >>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>scif is a OFA device from Intel. >>>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry >>>>>>>>> >>>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov < tismagi...@mail.ru >>>>>>>>>> wrote: >>>>>>>>>>Hi, Mike, >>>>>>>>>>that is what i have: >>>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n" >>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib >>>>>>>>>> >>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib >>>>>>>>>> >>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib >>>>>>>>>> >>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib >>>>>>>>>> +intel compiler paths >>>>>>>>>> >>>>>>>>>>$ echo $OPAL_PREFIX >>>>>>>>>> >>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 >>>>>>>>>> >>>>>>>>>>I don't use LD_PRELOAD. >>>>>>>>>> >>>>>>>>>>In the attached file(ompi_info.out) you will find the output of >>>>>>>>>>ompi_info -l 9 command. >>>>>>>>>> >>>>>>>>>>P.S . >>>>>>>>>>node1 $ ./mxm_perftest >>>>>>>>>>node2 $ ./mxm_perftest node1 -t send_lat >>>>>>>>>>[1432568685.067067] [node151:87372:0] shm.c:65 MXM WARN >>>>>>>>>>Could not open the KNEM device file $t /dev/knem : No such file or >>>>>>>>>>directory. Won't use knem. ( I don't have knem) >>>>>>>>>>[1432568685.069699] [node151:87372:0] ib_dev.c:531 MXM WARN >>>>>>>>>>skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a >>>>>>>>>>Mellanox device (???) >>>>>>>>>>Failed to create endpoint: No such device >>>>>>>>>> >>>>>>>>>>$ ibv_devinfo >>>>>>>>>>hca_id: mlx4_0 >>>>>>>>>> transport: InfiniBand (0) >>>>>>>>>> fw_ver: 2.10.600 >>>>>>>>>> node_guid: 0002:c903:00a1:13b0 >>>>>>>>>> sys_image_guid: 0002:c903:00a1:13b3 >>>>>>>>>> vendor_id: 0x02c9 >>>>>>>>>> vendor_part_id: 4099 >>>>>>>>>> hw_ver: 0x0 >>>>>>>>>> board_id: MT_1090120019 >>>>>>>>>> phys_port_cnt: 2 >>>>>>>>>> port: 1 >>>>>>>>>> state: PORT_ACTIVE (4) >>>>>>>>>> max_mtu: 4096 (5) >>>>>>>>>> active_mtu: 4096 (5) >>>>>>>>>> sm_lid: 1 >>>>>>>>>> port_lid: 83 >>>>>>>>>> port_lmc: 0x00 >>>>>>>>>> >>>>>>>>>> port: 2 >>>>>>>>>> state: PORT_DOWN (1) >>>>>>>>>> max_mtu: 4096 (5) >>>>>>>>>> active_mtu: 4096 (5) >>>>>>>>>> sm_lid: 0 >>>>>>>>>> port_lid: 0 >>>>>>>>>> port_lmc: 0x00 >>>>>>>>>> >>>>>>>>>>Best regards, >>>>>>>>>>Timur. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < >>>>>>>>>>mi...@dev.mellanox.co.il >: >>>>>>>>>>>Hi Timur, >>>>>>>>>>>seems that yalla component was not found in your OMPI tree. >>>>>>>>>>>can it be that your mpirun is not from hpcx? Can you please check >>>>>>>>>>>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing >>>>>>>>>>>to the right mpirun? >>>>>>>>>>> >>>>>>>>>>>Also, could you please check that yalla is present in the ompi_info >>>>>>>>>>>-l 9 output? >>>>>>>>>>> >>>>>>>>>>>Thanks >>>>>>>>>>> >>>>>>>>>>>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov < >>>>>>>>>>>tismagi...@mail.ru > wrote: >>>>>>>>>>>>I can password-less ssh to all nodes: >>>>>>>>>>>>base$ ssh node1 >>>>>>>>>>>>node1$ssh node2 >>>>>>>>>>>>Last login: Mon May 25 18:41:23 >>>>>>>>>>>>node2$ssh node3 >>>>>>>>>>>>Last login: Mon May 25 16:25:01 >>>>>>>>>>>>node3$ssh node4 >>>>>>>>>>>>Last login: Mon May 25 16:27:04 >>>>>>>>>>>>node4$ >>>>>>>>>>>> >>>>>>>>>>>>Is this correct? >>>>>>>>>>>> >>>>>>>>>>>>In ompi-1.9 i do not have no-tree-spawn problem. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < >>>>>>>>>>>>r...@open-mpi.org >: >>>>>>>>>>>> >>>>>>>>>>>>>I can’t speak to the mxm problem, but the no-tree-spawn issue >>>>>>>>>>>>>indicates that you don’t have password-less ssh authorized between >>>>>>>>>>>>>the compute nodes >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>Hello! >>>>>>>>>>>>>> >>>>>>>>>>>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; >>>>>>>>>>>>>>OFED-1.5.4.1; >>>>>>>>>>>>>>CentOS release 6.2; >>>>>>>>>>>>>>infiniband 4x FDR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>I have two problems: >>>>>>>>>>>>>>1. I can not use mxm : >>>>>>>>>>>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host >>>>>>>>>>>>>>node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 >>>>>>>>>>>>>>./hello >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>A requested component was not found, or was unable to be opened. >>>>>>>>>>>>>>This >>>>>>>>>>>>>>means that this component is either not installed or is unable to >>>>>>>>>>>>>>be >>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>>>libraries >>>>>>>>>>>>>>that the component requires are unable to be found/loaded). Note >>>>>>>>>>>>>>that >>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did not >>>>>>>>>>>>>>find. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>Host: node14 >>>>>>>>>>>>>> >>>>>>>>>>>>>>Framework: pml >>>>>>>>>>>>>> >>>>>>>>>>>>>>Component: yalla >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>>>process is >>>>>>>>>>>>>>likely to abort. There are many reasons that a parallel process >>>>>>>>>>>>>>can >>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>>>>>>>>>environment >>>>>>>>>>>>>>problems. This failure appears to be an internal failure; here's >>>>>>>>>>>>>>some >>>>>>>>>>>>>>additional information (which may only be relevant to an Open MPI >>>>>>>>>>>>>> >>>>>>>>>>>>>>developer): >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>>>> >>>>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>>>abort, >>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>> >>>>>>>>>>>>>>[node28:102377] Local abort before MPI_INIT completed >>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>> and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>>>abort, >>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>> >>>>>>>>>>>>>>[node29:105600] Local abort before MPI_INIT completed >>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>> and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>>>abort, >>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>> >>>>>>>>>>>>>>[node5:102409] Local abort before MPI_INIT completed >>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>>>abort, >>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>> >>>>>>>>>>>>>>[node14:85284] Local abort before MPI_INIT completed >>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>> >>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>>>> >>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>>>aborted. >>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero >>>>>>>>>>>>>>status, thus causing >>>>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Process name: [[9372,1],2] >>>>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>>>[login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>>>>>>>>to see all help / error messages >>>>>>>>>>>>>>[login:08295] 3 more processes have sent help message >>>>>>>>>>>>>>help-mpi-runtime / mpi_init:startup:internal-failur >>>>>>>>>>>>>>e >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca >>>>>>>>>>>>>>plm_rsh_no_tree_spawn 1 -np 4 ./hello >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>A requested component was not found, or was unable to be opened. >>>>>>>>>>>>>>This >>>>>>>>>>>>>>means that this component is either not installed or is unable to >>>>>>>>>>>>>>be >>>>>>>>>>>>>>used on your system (e.g., sometimes this means that shared >>>>>>>>>>>>>>libraries >>>>>>>>>>>>>>that the component requires are unable to be found/loaded). Note >>>>>>>>>>>>>>that >>>>>>>>>>>>>>Open MPI stopped checking at the first component that it did not >>>>>>>>>>>>>>find. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>Host: node5 >>>>>>>>>>>>>> >>>>>>>>>>>>>>Framework: pml >>>>>>>>>>>>>> >>>>>>>>>>>>>>Component: yalla >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>>>abort, >>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>> >>>>>>>>>>>>>>[node5:102449] Local abort before MPI_INIT completed >>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>It looks like MPI_INIT failed for some reason; your parallel >>>>>>>>>>>>>>process is >>>>>>>>>>>>>>likely to abort. There are many reasons that a parallel process >>>>>>>>>>>>>>can >>>>>>>>>>>>>>fail during MPI_INIT; some of which are due to configuration or >>>>>>>>>>>>>>environment >>>>>>>>>>>>>>problems. This failure appears to be an internal failure; here's >>>>>>>>>>>>>>some >>>>>>>>>>>>>>additional information (which may only be relevant to an Open MPI >>>>>>>>>>>>>> >>>>>>>>>>>>>>developer): >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> mca_pml_base_open() failed >>>>>>>>>>>>>> >>>>>>>>>>>>>> --> Returned "Not found" (-13) instead of "Success" (0) >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>Primary job terminated normally, but 1 process returned >>>>>>>>>>>>>> >>>>>>>>>>>>>>a non-zero exit code.. Per user-direction, the job has been >>>>>>>>>>>>>>aborted. >>>>>>>>>>>>>>------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** An error occurred in MPI_Init >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** on a NULL communicator >>>>>>>>>>>>>> >>>>>>>>>>>>>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>>>>>>>>abort, >>>>>>>>>>>>>>*** and potentially your MPI job) >>>>>>>>>>>>>> >>>>>>>>>>>>>>[node14:85325] Local abort before MPI_INIT completed >>>>>>>>>>>>>>successfully; not able to aggregate error messages, >>>>>>>>>>>>>>and not able to guarantee that all other processes were killed! >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>mpirun detected that one or more processes exited with non-zero >>>>>>>>>>>>>>status, thus causing >>>>>>>>>>>>>>the job to be terminated. The first process to do so was: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Process name: [[9619,1],0] >>>>>>>>>>>>>> >>>>>>>>>>>>>> Exit code: 1 >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>[login:08552] 1 more process has sent help message >>>>>>>>>>>>>>help-mca-base.txt / find-available:not-valid >>>>>>>>>>>>>>[login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 >>>>>>>>>>>>>>to see all help / error messages >>>>>>>>>>>>>> >>>>>>>>>>>>>>2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd >>>>>>>>>>>>>>line : >>>>>>>>>>>>>>$mpirun -host node5,node14,node28,node29 -np 4 ./hello >>>>>>>>>>>>>>sh: -c: line 0: syntax error near unexpected token `--tree-spawn' >>>>>>>>>>>>>> >>>>>>>>>>>>>>sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; >>>>>>>>>>>>>>OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc >>>>>>>>>>>>>>es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; >>>>>>>>>>>>>>export OPAL_PREFIX; PATH=/gpfs/NETHOME/o >>>>>>>>>>>>>>ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH >>>>>>>>>>>>>> ; export PA >>>>>>>>>>>>>>TH ; >>>>>>>>>>>>>>LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi >>>>>>>>>>>>>>-mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; >>>>>>>>>>>>>>DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice >>>>>>>>>>>>>>vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH >>>>>>>>>>>>>> ; expor >>>>>>>>>>>>>>t DYLD_LIBRARY_PATH ; >>>>>>>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o >>>>>>>>>>>>>>mpi-mellanox-v1.8/bin/orted --hnp-topo-sig >>>>>>>>>>>>>>2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es >>>>>>>>>>>>>>s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs >>>>>>>>>>>>>>"5" -mca orte_parent_uri "625606656.1;tc >>>>>>>>>>>>>>p://10.65.0.105,10.64.0.105,10.67.0.105:56862 " -mca orte_hnp_uri >>>>>>>>>>>>>>"625606656.0; tcp://10.65.0.2,10.67.0.2,8 >>>>>>>>>>>>>>3.149.214.101, 10.64.0.2:54893 " --mca pml "yalla" -mca >>>>>>>>>>>>>>plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s >>>>>>>>>>>>>>pawn' >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>ORTE was unable to reliably start one or more daemons. >>>>>>>>>>>>>> >>>>>>>>>>>>>>This usually is caused by: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>* not finding the required libraries and/or binaries on >>>>>>>>>>>>>> >>>>>>>>>>>>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>>>>>>>>>>>> >>>>>>>>>>>>>> settings, or configure OMPI with >>>>>>>>>>>>>>--enable-orterun-prefix-by-default >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>* lack of authority to execute on one or more specified nodes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please verify your allocation and authorities. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>* the inability to write startup files into /tmp >>>>>>>>>>>>>>(--tmpdir/orte_tmpdir_base). >>>>>>>>>>>>>> Please check with your sys admin to determine the correct >>>>>>>>>>>>>>location to use. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>* compilation of the orted with dynamic libraries when static >>>>>>>>>>>>>>are required >>>>>>>>>>>>>> (e.g., on Cray). Please check your configure cmd line and >>>>>>>>>>>>>>consider using >>>>>>>>>>>>>> one of the contrib/platform definitions for your system type. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>* an inability to create a connection back to mpirun due to a >>>>>>>>>>>>>> >>>>>>>>>>>>>> lack of common network interfaces and/or no route found between >>>>>>>>>>>>>> >>>>>>>>>>>>>> them. Please check network connectivity (including firewalls >>>>>>>>>>>>>> >>>>>>>>>>>>>> and network routing requirements). >>>>>>>>>>>>>> >>>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>>mpirun: abort is already in progress...hit ctrl-c again to >>>>>>>>>>>>>>forcibly terminate >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>Thank you for your comments. >>>>>>>>>>>>>> >>>>>>>>>>>>>>Best regards, >>>>>>>>>>>>>>Timur. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>>>users mailing list >>>>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>Link to this post: >>>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26919.php >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>users mailing list >>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>Link to this post: >>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2015/05/26922.php >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>-- >>>>>>>>>>> >>>>>>>>>>>Kind Regards, >>>>>>>>>>> >>>>>>>>>>>M. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>-- >>>>>>>>> >>>>>>>>>Kind Regards, >>>>>>>>> >>>>>>>>>M. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-- >>>>>>> >>>>>>>Kind Regards, >>>>>>> >>>>>>>M. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>-- >>>>> >>>>>Kind Regards, >>>>> >>>>>M. >>>> >>>> >>> >>> >>> >>>-- >>> >>>Kind Regards, >>> >>>M. >> >> > > > >-- > >Kind Regards, > >M.
cm_mxm.out
Description: Binary data
yalla.out
Description: Binary data