Re: [OMPI users] scatter/gather, tcp, 3 nodes, homogeneous, # RAM

2016-06-15 Thread Gilles Gouaillardet
Here is the idea on how to get the number of tasks per node MPI_Comm intranode_comm; int tasks_per_local_node; MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &intranode_comm); MPI_Comm_size(intranode_comm, &tasks_per_local_node) MPI_Comm_free(&intranode_comm);

Re: [OMPI users] scatter/gather, tcp, 3 nodes, homogeneous, # RAM

2016-06-15 Thread MM
On 14 June 2016 at 13:56, Gilles Gouaillardet wrote: On Tuesday, June 14, 2016, MM wrote: > > Hello, > I have the following 3 1-socket nodes: > > node1: 4GB RAM 2-core: rank 0 rank 1 > node2: 4GB RAM 4-core: rank 2 rank 3 rank 4 rank 5 > node3: 8GB RAM 4-core: rank 6 rank 7 rank 8 rank 9 >

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-15 Thread Jason Maldonis
Hi Gilles, I would like to be able to run on anywhere from 1-16 nodes. Let me explain our (mpi/parallelism) situation briefly for more context: We have a "master" job that needs MPI functionality. This master job is written in python (we use mpi4py). The master job then makes spawn calls out to

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Gus Correa
On 06/15/2016 02:35 PM, Sasso, John (GE Power, Non-GE) wrote: Chuck, The per-process limits appear fine, including those for the resource mgr daemons: Limit Soft Limit Hard Limit Units Max address space unlimitedunlimitedb

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Nathan Hjelm
ibv_devinfo -v -Nathan On Jun 15, 2016, at 12:43 PM, "Sasso, John (GE Power, Non-GE)" wrote: QUESTION: Since the error said the system may have run out of queue pairs, how do I determine the # of queue pairs the IB HCA can support? -Original Message- From: users [mailto:users-boun.

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Nathan Hjelm
You ran out of queue pairs. There is no way around this for larger all-to-all transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully connect with SRQ or PP QPs. I recommend using XRC instead by adding: btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512 to your o

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
QUESTION: Since the error said the system may have run out of queue pairs, how do I determine the # of queue pairs the IB HCA can support? -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Sasso, John (GE Power, Non-GE) Sent: Wednesday, June 15, 2016 2:35

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
Chuck, The per-process limits appear fine, including those for the resource mgr daemons: Limit Soft Limit Hard Limit Units Max address space unlimitedunlimitedbytes Max core file size00

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Gus Correa
Hi John 1) For diagnostic, you could check the actual "per process" limits on the nodes while that big job is running: cat /proc/$PID/limits 2) If you're using a resource manager to launch the job, the resource manager daemon/deamons (local to the nodes) may have to to set the memlock and oth

Re: [OMPI users] Client-Server Shared Memory Transport

2016-06-15 Thread Ralph Castain
Oh sure - just not shared memory > On Jun 15, 2016, at 8:29 AM, Louis Williams wrote: > > Ralph, thanks for the quick reply. Is cross-job fast transport like > InfiniBand supported? > > Louis > > On Tue, Jun 14, 2016 at 3:53 PM Ralph Castain > wrote: > Nope - we d

Re: [OMPI users] Client-Server Shared Memory Transport

2016-06-15 Thread Louis Williams
Ralph, thanks for the quick reply. Is cross-job fast transport like InfiniBand supported? Louis On Tue, Jun 14, 2016 at 3:53 PM Ralph Castain wrote: > Nope - we don’t currently support cross-job shared memory operations. > Nathan has talked about doing so for vader, but not at this time. > > >

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
In doing testing with IMB, I find that running a 4200+ core case with the IMB test Alltoall, and message lengths of 16..1024 bytes (as per -msglog 4:10 IMB option), it fails with: -- A process failed to create a queue pair.

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Llolsten Kaonga
Hello Mehmet, When we do OS installs, our lab usually just downloads the latest stable version of Open MPI. We try not to move versions of Open MPI we may already have lying around - mostly because we don't trust our book-keeping abilities. We have not had any trouble using this approach but

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Llolsten Kaonga
Hello Sreenidhi, In our testing, we cannot use Mellanox OFED for compliance reasons. So, we use regular OFED. We test both Mellanox and Intel DUTs (NICs, switches, gateways, etc). I thank you. -- Llolsten From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Sreenidhi Bha

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Peter Kjellström
On Wed, 15 Jun 2016 15:00:05 +0530 Sreenidhi Bharathkar Ramesh wrote: > hi Mehmet / Llolsten / Peter, > > Just curious to know what is the NIC or fabric you are using in your > respective clusters. > > If it is Mellanox, is it not better to use the MLNX_OFED ? We run both Mellanox ConnectX3 ba

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-15 Thread Gilles Gouaillardet
Jason, How many nodes are you running on ? Since you have an IB network, IB is used for intra node communication between tasks that are not part of the same OpenMPI job (read spawn group) I can make a simple patch to use tcp instead of IB for these intra node communication, Let me know if you are

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Sreenidhi Bharathkar Ramesh
hi Mehmet / Llolsten / Peter, Just curious to know what is the NIC or fabric you are using in your respective clusters. If it is Mellanox, is it not better to use the MLNX_OFED ? This information may help us build our cluster. Hence, asking. Thanks, - Sreenidhi. On Wed, Jun 15, 2016 at 1:17 PM

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Peter Kjellström
On Tue, 14 Jun 2016 13:18:33 -0400 "Llolsten Kaonga" wrote: > Hello Grigory, > > I am not sure what Redhat does exactly but when you install the OS, > there is always an InfiniBand Support module during the installation > process. We never check/install that module when we do OS > installations

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Peter Kjellström
On Tue, 14 Jun 2016 16:20:42 + Grigory Shamov wrote: > On 2016-06-14, 3:42 AM, "users on behalf of Peter Kjellström" > wrote: > > >On Mon, 13 Jun 2016 19:04:59 -0400 > >Mehmet Belgin wrote: > > > >> Greetings! > >> > >> We have not upgraded our OFED stack for a very long time, and still