Re: [OMPI users] Job distribution on many-core NUMA system

Eugene Loh Fri, 28 Aug 2009 13:15:24 -0400

Big topic and actually the subject of much recent discussion. Here area few comments:

1) "Optimally" depends on what you're doing. A big issue is makingsure each MPI process gets as much memory bandwidth (and cache and othershared resources) as possible. This would argue that processes*should* be spread over as many sockets as possible. And, indeed, someMPIs default to this behavior. It depends on lots of things, includinghow much of the machine you're using.

2) Currently (1.3.2), there is rankfile support. This is probably alittle bit more gruesome than you hope for. E.g., if you have multiplejobs, you need to custom tailor the rankfile for each. Another heavyhammer might be to write scripts that, depending on job and process rankand stuff, launches the MPI process using numactl. I'm not convincedyou want to go that route, but at some level it offers you the abilityto do what you're asking for.

3) Soon, (1.3.4?, or use the trunk) there should be some richer supportincluding bind-to-socket, bind-to-core, etc. I happen to likebind-to-socket. Sounds like you like bind-to-core. Ralph's putbacksshould make each of us happy. But if multiple jobs are being launched,you might still not yet like the extent of the functionality.

4) The default behavior of the OS may depend on the OS, the BIOS (whichnumbers the cores), etc.

Caveat: this note is hastily written with fuzzy knowledge of the statusof all the subissues. Just a quick message to start what I think willin any case be a long e-mail thread.


A. Austen wrote:

Hello all.

I apologize if this has been addressed in the FAQ or on the mailing
list, but I spent a fair amount of time searching both and found no
direct answers.

I use OpenMPI, currently version 1.3.2, on an 8-way quad-core AMD
Opteron machine.  So 32 cores in total.  The computer runs a modern 2.6
family Linux kernel.  I don't at the present time use a resource manager
like SLURM, since there is at most one other user and we don't step on
each others' toes.

What I find is that when I launch MPI jobs, I don't see the processes
packed optimally onto the cores.  I think OMPI should try to place jobs
in such a way that the tasks fill up all four cores of one socket, then
as many cores as necessary on the next socket, and so on.

So for example, if I want to run 6 tasks, each of which needs 4
processors, I can see that as I start the jobs up, the processes for
each job get distributed without regard to NUMA optimality -- 2 of them
might be on processor A, 1 on processor B, and the fourth on processor
C.  Since I have dynamic clocking enabled, I can check this by looking
at /proc/cpuinfo (see what the clock speeds are on each core when the
system is otherwise quiescent), or by using top and turning on the
display for each processor.

Obviously, in terms of maximizing performance, this is bad.  Once I
start getting up to say 5 of the 4-processor jobs, I can see
computational throughput degrade heavily.  I would hypothesize there is
heavy contention on the HyperTransport links.

I saw the processor and memory affinity options, but that seems to
address a different problem -- namely, keep the jobs pinned to specific
resources.  I also want that, but it's not the same issue as I discussed
above.

So, I guess I have several questions:

1. Is there any way to have OpenMPI automatically tell Linux via its
affinity and NUMA-related APIs that the OMPI jobs should be scheduled in
such a way that they fill the cores on particular sockets, and try to
use adjacent sockets?

2. I think the rankfile may be the way for me to address this issue, but
do I need to have a different rankfile for each job?  The FAQ shows the
ability to wildcard the "core" number/ID field.  Is there a way to
wildcard the socket field, but not the core field, that is tell OMPI I
don't care what socket you choose, but the job should always be mapped
onto the cores of a single socket?  The latter might not make sense for
a job using more than the number of cores per socket, but it would be
useful in that case.  On a job needing say more than 4 processes on a
quad-core, it probably makes sense to specifically tell OMPI which
sockets to use as well, to try to maintain the smallest number of
processor hops.

3. If my understanding is correct, and a rankfile will help me solve
this problem, can I safely turn on processor and memory affinity such
that the different OMPI jobs I manually launched will not vie for
affinity on the same processor cores/memory chunks?

Thank you.

Re: [OMPI users] Job distribution on many-core NUMA system

Reply via email to