Hello! I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2; OFED-1.5.4.1; CentOS release 6.2; infiniband 4x FDR
I have two problems: 1. I can not use mxm : 1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: node14 Framework: pml Component: yalla -------------------------------------------------------------------------- *** An error occurred in MPI_Init -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) *** An error occurred in MPI_Init [node28:102377] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node29:105600] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node5:102409] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node14:85284] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[9372,1],2] Exit code: 1 -------------------------------------------------------------------------- [login:08295] 3 more processes have sent help message help-mca-base.txt / find-available:not-valid [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [login:08295] 3 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failur e 1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: node5 Framework: pml Component: yalla -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node5:102449] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node14:85325] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[9619,1],0] Exit code: 1 -------------------------------------------------------------------------- [login:08552] 1 more process has sent help message help-mca-base.txt / find-available:not-valid [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line : $mpirun -host node5,node14,node28,node29 -np 4 ./hello sh: -c: line 0: syntax error near unexpected token `--tree-spawn' sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export OPAL_PREFIX; PATH=/gpfs/NETHOME/o ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH ; export PA TH ; LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH ; expor t DYLD_LIBRARY_PATH ; /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_es s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca orte_parent_uri "625606656.1;tc p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri "625606656.0;tcp://10.65.0.2,10.67.0.2,8 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn "0" -mca plm "rsh" ) --tree-s pawn' -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate Thank you for your comments. Best regards, Timur.