We have a couple of clusters with Qlogic Infinipath/Intel TrueScale networking. While testing a kernel upgrade we find that the Truescale drivers will no longer build against recent RHEL kernels. Intel tells us that the Omnipath drivers will work for True Scale adapters so we install those. Basic functionality appears fine however we are having trouble getting OpenMPI to work.
Using our existing builds of OpenMPI 1.10 jobs receive lots of signal 11 and crash(output attached) If we modify LD_LIBRARY_PATH to point to the directory containing the compatibility library provides as part of the OmniPath drivers it instead produces complainst about not finding /dev/hfi1_0 which exists on our cluster with actual OmniPath but not on the clusters with TrueScale (output also attached). We had a similar issue with Intel MPI but there it was possible to get it to work by passing a -psm option to mpirun. That combined with the mention of PSM2 in the output when complaining about /dev/hfi1_0 makes us think OpenMPI is trying to run with PSM2 rather than the original PSM and failing because that isn't supported by TrueScale. We hoped that there would be an mca parameter or combination of parameters that would resolve this issue but while Googling has turned up a few things that look like they would force the use of PSM over PSM2 none of them seem to make a difference. Any suggestions? William
mpi_pi:16465 terminated with signal 11 at PC=2b213094aa0e SP=7ffc6d5ba5e0. Backtrace: mpi_pi:16470 terminated with signal 11 at PC=2ae8d364fa0e SP=7ffce1c62ee0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ae8d364fa0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ae8d4026c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:16463 terminated with signal 11 at PC=2b368a310a0e SP=7ffd71d817e0. Backtrace: mpi_pi:16466 terminated with signal 11 at PC=2b1a36c91a0e SP=7ffdbf472be0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b1a36c91a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1a37668c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:16468 terminated with signal 11 at PC=2ab4a84fba0e SP=7ffe40d69660. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab4a84fba0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab4a8ed2c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b213094aa0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b2131321c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:16472 terminated with signal 11 at PC=2b373d729a0e SP=7ffce87428e0. Backtrace: mpi_pi:16464 terminated with signal 11 at PC=2b0253fe4a0e SP=7ffdb96f12e0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0253fe4a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b368a310a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b368ace7c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b373d729a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b373e100c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b02549bbc05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:19144 terminated with signal 11 at PC=2ad2bd9aba0e SP=7ffdd91828e0. Backtrace: mpi_pi:16462 terminated with signal 11 at PC=2ac24f9e5a0e SP=7ffcea97b160. Backtrace: mpi_pi:19148 terminated with signal 11 at PC=2b413cc4ca0e SP=7ffce3d51ee0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b413cc4ca0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] mpi_pi:16469 terminated with signal 11 at PC=2ae1e8fdda0e SP=7fffa67fe2e0. Backtrace: mpi_pi:16471 terminated with signal 11 at PC=2ac89c0b5a0e SP=7ffe1157ba60. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac24f9e5a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac2503bcc05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ad2bd9aba0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ae1e8fdda0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac89c0b5a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac89ca8cc05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:16461 terminated with signal 11 at PC=2b36e76fea0e SP=7ffcfafc8ce0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b36e76fea0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b36e80d5c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:16467 terminated with signal 11 at PC=2b8f727bba0e SP=7fff92cb4360. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b8f727bba0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b8f73192c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:19150 terminated with signal 11 at PC=2b0532a9da0e SP=7ffceffbba60. Backtrace: /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ae1e99b4c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b413d623c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ad2be382c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:19152 terminated with signal 11 at PC=2b354780ea0e SP=7fff79407660. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b354780ea0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b35481e5c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:19145 terminated with signal 11 at PC=2ac42f835a0e SP=7ffd315cf0e0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0532a9da0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b0533474c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:19146 terminated with signal 11 at PC=2ab123095a0e SP=7ffe36abc3e0. Backtrace: mpi_pi:19149 terminated with signal 11 at PC=2ab36bfaea0e SP=7ffe6c72cce0. Backtrace: mpi_pi:19153 terminated with signal 11 at PC=2b65d0427a0e SP=7ffef6fb2960. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab36bfaea0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab36c985c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac42f835a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac43020cc05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab123095a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab123a6cc05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b65d0427a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b65d0dfec05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] mpi_pi:19154 terminated with signal 11 at PC=2b0150df2a0e SP=7ffc3ed9bce0. Backtrace: mpi_pi:19147 terminated with signal 11 at PC=2ba41ff9ca0e SP=7ffe444e30e0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0150df2a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b01517c9c05] mpi_pi:19151 terminated with signal 11 at PC=2ad302bb4a0e SP=7ffebb06d5e0. Backtrace: mpi_pi:19156 terminated with signal 11 at PC=2b77dba1ca0e SP=7ffe52d8d0e0. Backtrace: /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ba41ff9ca0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ba420973c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b77dba1ca0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b77dc3f3c05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] /shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ad302bb4a0e] /home/ccaawih/openmpi_pi/mpi_pi[0x401522] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ad30358bc05] /home/ccaawih/openmpi_pi/mpi_pi[0x4013f9] ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[9318,1],9] Exit code: 1 -------------------------------------------------------------------------- + date Fri 19 Jan 15:16:52 GMT 2018
node-x02f-024.18227hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18228hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18229hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18231hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18226hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18230hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18233hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18232hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21044hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21046hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21049hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18235hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21045hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21047hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18234hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21051hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21048hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21050hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21053hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18236hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21052hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18237hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21054hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21055hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18231hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18226PSM2 no hfi units are available (err=23) node-x02f-024.18227PSM2 no hfi units are available (err=23) node-x02f-024.18227hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18228PSM2 no hfi units are available (err=23) node-x02f-024.18228hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18229PSM2 no hfi units are available (err=23) node-x02f-024.18226hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18230PSM2 no hfi units are available (err=23) node-x02f-024.18229hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18231PSM2 no hfi units are available (err=23) node-x02f-021.21047hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18230hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21044PSM2 no hfi units are available (err=23) node-x02f-021.21045PSM2 no hfi units are available (err=23) node-x02f-021.21046PSM2 no hfi units are available (err=23) node-x02f-021.21047PSM2 no hfi units are available (err=23) node-x02f-021.21049PSM2 no hfi units are available (err=23) node-x02f-021.21049hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21044hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21045hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21046hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18235hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18235PSM2 no hfi units are available (err=23) node-x02f-021.21048PSM2 no hfi units are available (err=23) node-x02f-021.21050PSM2 no hfi units are available (err=23) node-x02f-021.21051PSM2 no hfi units are available (err=23) node-x02f-021.21053PSM2 no hfi units are available (err=23) node-x02f-021.21051hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21048hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21053hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21050hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18233hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18233PSM2 no hfi units are available (err=23) node-x02f-024.18232hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18232PSM2 no hfi units are available (err=23) -------------------------------------------------------------------------- PSM was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint -------------------------------------------------------------------------- node-x02f-024.18234hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18234PSM2 no hfi units are available (err=23) node-x02f-024.18236hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-024.18236PSM2 no hfi units are available (err=23) node-x02f-024.18237PSM2 no hfi units are available (err=23) node-x02f-024.18237hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21052hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21052PSM2 no hfi units are available (err=23) node-x02f-021.21054PSM2 no hfi units are available (err=23) node-x02f-021.21054hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21055hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out node-x02f-021.21055PSM2 no hfi units are available (err=23) -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21047] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21048] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21050] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21051] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21049] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21054] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21055] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21052] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21044] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21045] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21046] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-021:21053] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18228] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18229] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18230] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18235] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18234] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18227] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18237] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18226] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18233] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18236] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18231] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [node-x02f-024:18232] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[51773,1],14] Exit code: 1 -------------------------------------------------------------------------- [node-x02f-021:21038] 47 more processes have sent help message help-mtl-psm.txt / unable to open endpoint [node-x02f-021:21038] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages + date Fri 19 Jan 15:24:53 GMT 2018
signature.asc
Description: PGP signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users