The --bind-to-none switch didn't help, I'm still getting the same errors.

The only NUMA package installed on the nodes in this CentOS 6.2 cluster is the
following:

numactl-2.0.7-3.el6.x86_64
this package is described as: numactl.x86_64 : Library for tuning for Non 
Uniform Memory Access machines

Since many of these systems are NUMA systems (with separate memory address 
spaces
for the sockets) could it be that the correct NUMA libraries aren't installed?

Here are some of the other NUMA packages available for CentOS 6.x:

yum search numa | less

                Loaded plugins: fastestmirror
                Loading mirror speeds from cached hostfile
                ============================== N/S Matched: numa 
===============================
                numactl-devel.i686 : Development package for building 
Applications that use numa
                numactl-devel.x86_64 : Development package for building 
Applications that use
                                     : numa
                numad.x86_64 : NUMA user daemon
                numactl.i686 : Library for tuning for Non Uniform Memory Access 
machines
                numactl.x86_64 : Library for tuning for Non Uniform Memory 
Access machines

-Bill Lane
________________________________________
From: users [users-boun...@open-mpi.org] on behalf of Reuti 
[re...@staff.uni-marburg.de]
Sent: Thursday, August 28, 2014 3:27 AM
To: Open MPI Users
Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots 
(updated findings)

Am 28.08.2014 um 10:09 schrieb Lane, William:

> I have some updates on these issues and some test results as well.
>
> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs via 
> the SGE orte parallel environment received
> errors whenever more slots are requested than there are actual cores on the 
> first node allocated to the job.

Does "-bind-to none" help? The binding is switched on by default in Open MPI 
1.8 onwards.


> The btl tcp,self switch passed to mpirun made significant differences in 
> performance as per the below:
>
> Even with the oversubscribe option, the memory mapping errors still persist. 
> On 32 core nodes and with HPL run compiled for openmpi/1.8.2,  it reliably 
> starts failing at 20 cores allocated. Note that I tested with 'btl tcp,self' 
> defined and it does slow down the solve by 2 on a quick solve. The results on 
> a larger solve would probably be more dramatic:
> - Quick HPL 16 core with SM: ~19GFlops
> - Quick HPL 16 core without SM: ~10GFlops
>
> Unfortunately, a recompiled HPL did not work, but it did give us more 
> information (error below). Still trying a couple things.
>
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
>   Bind to:     CORE
>   Node:        csclprd3-0-7
>   #processes:  2
>   #cpus:       1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
>
> When using the SGE make parallel environment to submit jobs everything worked 
> perfectly.
> I noticed when using the make PE, the number of slots allocated from each 
> node to the job
> corresponded to the number of CPU's and disregarded any additional cores 
> within a CPU and
> any hyperthreading cores.

For SGE the hyperthreading cores count as normal cores. In principle it's 
possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit the 
number of cores for the "make" PE, or (better) limit it in each exechost 
defintion to the physical installed ones (this is what I set up usually - maybe 
leaving hyperthreading switched on gives some room for the kernel processes 
this way).


> Here are the definitions of the two parallel environments tested (with orte 
> always failing when
> more slots are requested than there are CPU cores on the first node allocated 
> to the job by
> SGE):
>
> [root@csclprd3 ~]# qconf -sp orte
> pe_name            orte
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
> qsort_args         NONE
>
> [root@csclprd3 ~]# qconf -sp make
> pe_name            make
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    NONE
> stop_proc_args     NONE
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
> qsort_args         NONE
>
> Although everything seems to work with the make PE, I'd still like
> to know why? Because on a much older version of openMPI loaded
> on an older version of CentOS, SGE and ROCKS, using all physical
> cores, as well as all hyperthreads was never a problem (even on NUMA
> nodes).
>
> What is the recommended SGE parallel environment definition for
> OpenMPI 1.8.2?

Whether you prefer $fill_up or $round_robin is up to you - do you prefer all 
your processes on the least amount of machines or spread around in the cluster. 
If there is much communication maybe it's better on less machines, but if each 
process has heavy I/O to the local scratch disk spreading it around may be the 
preferred choice. This doesn't make any difference to Open MPI, as the 
generated $PE_HOSTFILE contains just the list of granted slots. Doing it in an 
$fill_up style will of course fill the first node including the hyperthreading 
ones before moving to the next machine (`man sge_pe`).

-- Reuti


> I apologize for the length of this, but I thought it best to provide more
> information than less.
>
> Thank you in advance,
>
> -Bill Lane
>
> ________________________________________
> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres) 
> [jsquy...@cisco.com]
> Sent: Friday, August 08, 2014 5:25 AM
> To: Open MPI User's List
> Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots
>
> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote:
>
>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in 
>> addition to
>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I 
>> believe
>> the "--mca btl tcp,self" switch limits inter-process communication within a 
>> node to using the TCP
>> loopback rather than shared memory.
>
> Correct.  You will not be using shared memory for MPI communication at all -- 
> just TCP.
>
>> I should also point out that all of the nodes
>> on this cluster feature NUMA architecture.
>>
>> Will using the "--mca btl tcp,self" switch to mpirun result in any degraded 
>> performance
>> issues over using shared memory?
>
> Generally yes, but it depends on your application.  If your application does 
> very little MPI communication, then the difference between shared memory and 
> TCP is likely negligible.
>
> I'd strongly suggest two things:
>
> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible)
> - Run your program through a memory-checking debugger such as Valgrind
>
> Seg faults like you initially described can be caused by errors in your MPI 
> application itself -- the fact that using TCP only (and not shared memory) 
> avoids the segvs does not mean that the issue is actually fixed; it may well 
> mean that the error is still there, but is happening in a case that doesn't 
> seem to cause enough damage to cause a segv.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/24951.php
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is STRICTLY PROHIBITED. If you have received this 
> message in error, please notify us immediately by calling (310) 423-6428 and 
> destroy the related message. Thank You for your cooperation.
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25176.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25179.php
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation.

Reply via email to