We have a couple of clusters with Qlogic Infinipath/Intel TrueScale
networking.  While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels.  Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those.  Basic functionality appears fine however we are having
trouble getting OpenMPI to work.

Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)

If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it instead
produces complainst about not finding /dev/hfi1_0 which exists on our
cluster with actual OmniPath but not on the clusters with TrueScale
(output also attached).

We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun.  That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.

We hoped that there would be an mca parameter or combination of parameters
that would resolve this issue but while Googling has turned up a few
things that look like they would force the use of PSM over PSM2 none of
them seem to make a difference.

Any suggestions?

William
mpi_pi:16465 terminated with signal 11 at PC=2b213094aa0e SP=7ffc6d5ba5e0.  
Backtrace:

mpi_pi:16470 terminated with signal 11 at PC=2ae8d364fa0e SP=7ffce1c62ee0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ae8d364fa0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ae8d4026c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16463 terminated with signal 11 at PC=2b368a310a0e SP=7ffd71d817e0.  
Backtrace:

mpi_pi:16466 terminated with signal 11 at PC=2b1a36c91a0e SP=7ffdbf472be0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b1a36c91a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1a37668c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16468 terminated with signal 11 at PC=2ab4a84fba0e SP=7ffe40d69660.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab4a84fba0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab4a8ed2c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b213094aa0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b2131321c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16472 terminated with signal 11 at PC=2b373d729a0e SP=7ffce87428e0.  
Backtrace:

mpi_pi:16464 terminated with signal 11 at PC=2b0253fe4a0e SP=7ffdb96f12e0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0253fe4a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b368a310a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b368ace7c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b373d729a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b373e100c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b02549bbc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19144 terminated with signal 11 at PC=2ad2bd9aba0e SP=7ffdd91828e0.  
Backtrace:

mpi_pi:16462 terminated with signal 11 at PC=2ac24f9e5a0e SP=7ffcea97b160.  
Backtrace:

mpi_pi:19148 terminated with signal 11 at PC=2b413cc4ca0e SP=7ffce3d51ee0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b413cc4ca0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]

mpi_pi:16469 terminated with signal 11 at PC=2ae1e8fdda0e SP=7fffa67fe2e0.  
Backtrace:

mpi_pi:16471 terminated with signal 11 at PC=2ac89c0b5a0e SP=7ffe1157ba60.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac24f9e5a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac2503bcc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ad2bd9aba0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ae1e8fdda0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac89c0b5a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac89ca8cc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16461 terminated with signal 11 at PC=2b36e76fea0e SP=7ffcfafc8ce0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b36e76fea0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b36e80d5c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16467 terminated with signal 11 at PC=2b8f727bba0e SP=7fff92cb4360.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b8f727bba0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b8f73192c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19150 terminated with signal 11 at PC=2b0532a9da0e SP=7ffceffbba60.  
Backtrace:
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ae1e99b4c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b413d623c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ad2be382c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19152 terminated with signal 11 at PC=2b354780ea0e SP=7fff79407660.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b354780ea0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b35481e5c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19145 terminated with signal 11 at PC=2ac42f835a0e SP=7ffd315cf0e0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0532a9da0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b0533474c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19146 terminated with signal 11 at PC=2ab123095a0e SP=7ffe36abc3e0.  
Backtrace:

mpi_pi:19149 terminated with signal 11 at PC=2ab36bfaea0e SP=7ffe6c72cce0.  
Backtrace:

mpi_pi:19153 terminated with signal 11 at PC=2b65d0427a0e SP=7ffef6fb2960.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab36bfaea0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab36c985c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac42f835a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac43020cc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab123095a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab123a6cc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b65d0427a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b65d0dfec05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19154 terminated with signal 11 at PC=2b0150df2a0e SP=7ffc3ed9bce0.  
Backtrace:

mpi_pi:19147 terminated with signal 11 at PC=2ba41ff9ca0e SP=7ffe444e30e0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0150df2a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b01517c9c05]

mpi_pi:19151 terminated with signal 11 at PC=2ad302bb4a0e SP=7ffebb06d5e0.  
Backtrace:

mpi_pi:19156 terminated with signal 11 at PC=2b77dba1ca0e SP=7ffe52d8d0e0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ba41ff9ca0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ba420973c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b77dba1ca0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b77dc3f3c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ad302bb4a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ad30358bc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[9318,1],9]
  Exit code:    1
--------------------------------------------------------------------------
+ date
Fri 19 Jan 15:16:52 GMT 2018

node-x02f-024.18227hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18228hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18229hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18231hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18226hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18230hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18233hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18232hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21044hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21046hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21049hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18235hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21045hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21047hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18234hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21051hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21048hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21050hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21053hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18236hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21052hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18237hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21054hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21055hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18231hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18226PSM2 no hfi units are available (err=23)
node-x02f-024.18227PSM2 no hfi units are available (err=23)
node-x02f-024.18227hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18228PSM2 no hfi units are available (err=23)
node-x02f-024.18228hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18229PSM2 no hfi units are available (err=23)
node-x02f-024.18226hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18230PSM2 no hfi units are available (err=23)
node-x02f-024.18229hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18231PSM2 no hfi units are available (err=23)
node-x02f-021.21047hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18230hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21044PSM2 no hfi units are available (err=23)
node-x02f-021.21045PSM2 no hfi units are available (err=23)
node-x02f-021.21046PSM2 no hfi units are available (err=23)
node-x02f-021.21047PSM2 no hfi units are available (err=23)
node-x02f-021.21049PSM2 no hfi units are available (err=23)
node-x02f-021.21049hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21044hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21045hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21046hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18235hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18235PSM2 no hfi units are available (err=23)
node-x02f-021.21048PSM2 no hfi units are available (err=23)
node-x02f-021.21050PSM2 no hfi units are available (err=23)
node-x02f-021.21051PSM2 no hfi units are available (err=23)
node-x02f-021.21053PSM2 no hfi units are available (err=23)
node-x02f-021.21051hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21048hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21053hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21050hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18233hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18233PSM2 no hfi units are available (err=23)
node-x02f-024.18232hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18232PSM2 no hfi units are available (err=23)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning. 

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
node-x02f-024.18234hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18234PSM2 no hfi units are available (err=23)
node-x02f-024.18236hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-024.18236PSM2 no hfi units are available (err=23)
node-x02f-024.18237PSM2 no hfi units are available (err=23)
node-x02f-024.18237hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21052hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21052PSM2 no hfi units are available (err=23)
node-x02f-021.21054PSM2 no hfi units are available (err=23)
node-x02f-021.21054hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21055hfi_wait_for_device: The /dev/hfi1_0 device failed to appear 
after 15.0 seconds: Connection timed out
node-x02f-021.21055PSM2 no hfi units are available (err=23)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21047] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21048] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21050] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21051] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21049] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21054] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21055] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21052] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21044] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21045] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21046] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-021:21053] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18228] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18229] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18230] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18235] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18234] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18227] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18237] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18226] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18233] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18236] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18231] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-x02f-024:18232] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed!
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[51773,1],14]
  Exit code:    1
--------------------------------------------------------------------------
[node-x02f-021:21038] 47 more processes have sent help message help-mtl-psm.txt 
/ unable to open endpoint
[node-x02f-021:21038] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages
+ date
Fri 19 Jan 15:24:53 GMT 2018

Attachment: signature.asc
Description: PGP signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to