Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Lane, William Thu, 28 Aug 2014 04:09:03 -0400 (EDT)

I have some updates on these issues and some test results as well.

We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs via 
the SGE orte parallel environment received
errors whenever more slots are requested than there are actual cores on the 
first node allocated to the job.

The btl tcp,self switch passed to mpirun made significant differences in 
performance as per the below:

Even with the oversubscribe option, the memory mapping errors still persist. On 
32 core nodes and with HPL run compiled for openmpi/1.8.2,  it reliably starts 
failing at 20 cores allocated. Note that I tested with 'btl tcp,self' defined 
and it does slow down the solve by 2 on a quick solve. The results on a larger 
solve would probably be more dramatic:
- Quick HPL 16 core with SM: ~19GFlops
- Quick HPL 16 core without SM: ~10GFlops

Unfortunately, a recompiled HPL did not work, but it did give us more 
information (error below). Still trying a couple things.

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        csclprd3-0-7
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.

When using the SGE make parallel environment to submit jobs everything worked 
perfectly.
I noticed when using the make PE, the number of slots allocated from each node 
to the job
corresponded to the number of CPU's and disregarded any additional cores within 
a CPU and
any hyperthreading cores.

Here are the definitions of the two parallel environments tested (with orte 
always failing when
more slots are requested than there are CPU cores on the first node allocated 
to the job by
SGE):

[root@csclprd3 ~]# qconf -sp orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

[root@csclprd3 ~]# qconf -sp make
pe_name            make
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

Although everything seems to work with the make PE, I'd still like
to know why? Because on a much older version of openMPI loaded
on an older version of CentOS, SGE and ROCKS, using all physical
cores, as well as all hyperthreads was never a problem (even on NUMA
nodes).

What is the recommended SGE parallel environment definition for
OpenMPI 1.8.2?

I apologize for the length of this, but I thought it best to provide more
information than less.

Thank you in advance,

-Bill Lane

________________________________________
From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres) 
[jsquy...@cisco.com]
Sent: Friday, August 08, 2014 5:25 AM
To: Open MPI User's List
Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots

On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote:

> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in 
> addition to
> the requirement to include the --mca btl_tcp_if_include eth0 switch). I 
> believe
> the "--mca btl tcp,self" switch limits inter-process communication within a 
> node to using the TCP
> loopback rather than shared memory.

Correct.  You will not be using shared memory for MPI communication at all -- 
just TCP.

> I should also point out that all of the nodes
> on this cluster feature NUMA architecture.
>
> Will using the "--mca btl tcp,self" switch to mpirun result in any degraded 
> performance
> issues over using shared memory?

Generally yes, but it depends on your application.  If your application does 
very little MPI communication, then the difference between shared memory and 
TCP is likely negligible.

I'd strongly suggest two things:

- Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible)
- Run your program through a memory-checking debugger such as Valgrind

Seg faults like you initially described can be caused by errors in your MPI 
application itself -- the fact that using TCP only (and not shared memory) 
avoids the segvs does not mean that the issue is actually fixed; it may well 
mean that the error is still there, but is happening in a case that doesn't 
seem to cause enough damage to cause a segv.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24951.php
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation.

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Reply via email to