Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

2017-01-18 Thread William Hay
On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:
> As I recall, the problem was that qrsh isn???t available on the backend 
> compute nodes, and so we can???t use a tree for launch. If that isn???t true, 
> then we can certainly adjust it.
>
qrsh should be available on all nodes of a SoGE cluster but, depending on how 
things are set up, may not be 
findable (ie not in the PATH) when you qrsh -inherit into a node.  A workaround 
would be to start backend 
processes with qrsh -inherit -v PATH which will copy the PATH from the master 
node to the slave node 
process or otherwise pass the location of qrsh from one node or another.  That 
of course assumes that 
 qrsh is in the same location on all nodes.

I've tested that it is possible to qrsh from the head node of a job to a slave 
node and then on to
another slave node by this method.

William


> > On Jan 17, 2017, at 9:37 AM, Mark Dixon  wrote:
> > 
> > Hi,
> > 
> > While commissioning a new cluster, I wanted to run HPL across the whole 
> > thing using openmpi 2.0.1.
> > 
> > I couldn't get it to start on more than 129 hosts under Son of Gridengine 
> > (128 remote plus the localhost running the mpirun command). openmpi would 
> > sit there, waiting for all the orted's to check in; however, there were 
> > "only" a maximum of 128 qrsh processes, therefore a maximum of 128 orted's, 
> > therefore waiting a lng time.
> > 
> > Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job to 
> > launch.
> > 
> > Is this intentional, please?
> > 
> > Doesn't openmpi use a tree-like startup sometimes - any particular reason 
> > it's not using it here?


signature.asc
Description: Digital signature
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] BLCR + Qlogic infiniband

2012-11-28 Thread William Hay
I'm trying to build openmpi with support for BLCR plus qlogic infiniband
(plus grid engine).  Everything seems to compile OK and checkpoints are
taken but whenever I try to restore a checkpoint I get the following error:
- do_mmap(, 2aaab18c7000, 1000, ...) failed:
ffea
- mmap failed: /dev/ipath
- thaw_threads returned error, aborting. -22
- thaw_threads returned error, aborting. -22
Restart failed: Invalid argument

This occurs whether I specify psm or openib as the btl.

This looks like the sort of thing I would expect to be handled by the blcr
supporting code in openmpi.  So I guess I have a couple ofquestions.
1)Are Infiniband and BLCR support in openmpi compatible?
2)Are there any special tricks necessary to get them working together.


Re: [OMPI users] BLCR + Qlogic infiniband

2012-12-04 Thread William Hay
On 28 November 2012 11:14, William Hay  wrote:

> I'm trying to build openmpi with support for BLCR plus qlogic infiniband
> (plus grid engine).  Everything seems to compile OK and checkpoints are
> taken but whenever I try to restore a checkpoint I get the following error:
> - do_mmap(, 2aaab18c7000, 1000, ...) failed:
> ffea
> - mmap failed: /dev/ipath
> - thaw_threads returned error, aborting. -22
> - thaw_threads returned error, aborting. -22
> Restart failed: Invalid argument
>
> This occurs whether I specify psm or openib as the btl.
>
> This looks like the sort of thing I would expect to be handled by the blcr
> supporting code in openmpi.  So I guess I have a couple ofquestions.
> 1)Are Infiniband and BLCR support in openmpi compatible?
> 2)Are there any special tricks necessary to get them working together.
>
> A third question occurred to me that may be relevant.  How do I verify
that my openmpi install has blcr support built in?  I would have thought
this would mean that either mpiexec or binaries built with mpicc would have
libcr linked in.  However running ldd doesn't report this in either case.
 I'm setting LD_PRELOAD to point to it but I would have thought openmpi
would need to register a callback with blcr and it would be easier to do
this if the library were linked in rather than trying to detect whether it
has been LD_PRELOADed.  I'm building with the following options:
./configure --prefix=/home/ccaawih/openmpi-blcr --with-openib --without-psm
--with-blcr=/usr --with-blcr-libdir=/usr/lib64 --with-ft=cr
--enable-ft-thread --enable-mpi-threads --with-sge


[OMPI users] OpenMPI with PSM on True Scale with OmniPath drivers

2018-01-22 Thread William Hay
We have a couple of clusters with Qlogic Infinipath/Intel TrueScale
networking.  While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels.  Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those.  Basic functionality appears fine however we are having
trouble getting OpenMPI to work.

Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)

If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it instead
produces complainst about not finding /dev/hfi1_0 which exists on our
cluster with actual OmniPath but not on the clusters with TrueScale
(output also attached).

We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun.  That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.

We hoped that there would be an mca parameter or combination of parameters
that would resolve this issue but while Googling has turned up a few
things that look like they would force the use of PSM over PSM2 none of
them seem to make a difference.

Any suggestions?

William

mpi_pi:16465 terminated with signal 11 at PC=2b213094aa0e SP=7ffc6d5ba5e0.  
Backtrace:

mpi_pi:16470 terminated with signal 11 at PC=2ae8d364fa0e SP=7ffce1c62ee0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ae8d364fa0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ae8d4026c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16463 terminated with signal 11 at PC=2b368a310a0e SP=7ffd71d817e0.  
Backtrace:

mpi_pi:16466 terminated with signal 11 at PC=2b1a36c91a0e SP=7ffdbf472be0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b1a36c91a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1a37668c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16468 terminated with signal 11 at PC=2ab4a84fba0e SP=7ffe40d69660.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ab4a84fba0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab4a8ed2c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b213094aa0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b2131321c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:16472 terminated with signal 11 at PC=2b373d729a0e SP=7ffce87428e0.  
Backtrace:

mpi_pi:16464 terminated with signal 11 at PC=2b0253fe4a0e SP=7ffdb96f12e0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b0253fe4a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b368a310a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b368ace7c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b373d729a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b373e100c05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b02549bbc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]

mpi_pi:19144 terminated with signal 11 at PC=2ad2bd9aba0e SP=7ffdd91828e0.  
Backtrace:

mpi_pi:16462 terminated with signal 11 at PC=2ac24f9e5a0e SP=7ffcea97b160.  
Backtrace:

mpi_pi:19148 terminated with signal 11 at PC=2b413cc4ca0e SP=7ffce3d51ee0.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2b413cc4ca0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]

mpi_pi:16469 terminated with signal 11 at PC=2ae1e8fdda0e SP=7fffa67fe2e0.  
Backtrace:

mpi_pi:16471 terminated with signal 11 at PC=2ac89c0b5a0e SP=7ffe1157ba60.  
Backtrace:
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac24f9e5a0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac2503bcc05]
/home/ccaawih/openmpi_pi/mpi_pi[0x4013f9]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ad2bd9aba0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ae1e8fdda0e]
/home/ccaawih/openmpi_pi/mpi_pi[0x401522]
/shared/ucl/apps/openmpi/1.10.1/no-verbs/gnu-4.9.2/lib/libmpi.so.12(PMPI_Comm_size+0x3e)[0x2ac89c0b5a0e]
/home/cc