Big topic and actually the subject of much recent discussion. Here are a few comments:

1) "Optimally" depends on what you're doing. A big issue is making sure each MPI process gets as much memory bandwidth (and cache and other shared resources) as possible. This would argue that processes *should* be spread over as many sockets as possible. And, indeed, some MPIs default to this behavior. It depends on lots of things, including how much of the machine you're using.

2) Currently (1.3.2), there is rankfile support. This is probably a little bit more gruesome than you hope for. E.g., if you have multiple jobs, you need to custom tailor the rankfile for each. Another heavy hammer might be to write scripts that, depending on job and process rank and stuff, launches the MPI process using numactl. I'm not convinced you want to go that route, but at some level it offers you the ability to do what you're asking for.

3) Soon, (1.3.4?, or use the trunk) there should be some richer support including bind-to-socket, bind-to-core, etc. I happen to like bind-to-socket. Sounds like you like bind-to-core. Ralph's putbacks should make each of us happy. But if multiple jobs are being launched, you might still not yet like the extent of the functionality.

4) The default behavior of the OS may depend on the OS, the BIOS (which numbers the cores), etc.

Caveat: this note is hastily written with fuzzy knowledge of the status of all the subissues. Just a quick message to start what I think will in any case be a long e-mail thread.

A. Austen wrote:

Hello all.

I apologize if this has been addressed in the FAQ or on the mailing
list, but I spent a fair amount of time searching both and found no
direct answers.

I use OpenMPI, currently version 1.3.2, on an 8-way quad-core AMD
Opteron machine.  So 32 cores in total.  The computer runs a modern 2.6
family Linux kernel.  I don't at the present time use a resource manager
like SLURM, since there is at most one other user and we don't step on
each others' toes.

What I find is that when I launch MPI jobs, I don't see the processes
packed optimally onto the cores.  I think OMPI should try to place jobs
in such a way that the tasks fill up all four cores of one socket, then
as many cores as necessary on the next socket, and so on.

So for example, if I want to run 6 tasks, each of which needs 4
processors, I can see that as I start the jobs up, the processes for
each job get distributed without regard to NUMA optimality -- 2 of them
might be on processor A, 1 on processor B, and the fourth on processor
C.  Since I have dynamic clocking enabled, I can check this by looking
at /proc/cpuinfo (see what the clock speeds are on each core when the
system is otherwise quiescent), or by using top and turning on the
display for each processor.

Obviously, in terms of maximizing performance, this is bad.  Once I
start getting up to say 5 of the 4-processor jobs, I can see
computational throughput degrade heavily.  I would hypothesize there is
heavy contention on the HyperTransport links.

I saw the processor and memory affinity options, but that seems to
address a different problem -- namely, keep the jobs pinned to specific
resources.  I also want that, but it's not the same issue as I discussed
above.

So, I guess I have several questions:

1. Is there any way to have OpenMPI automatically tell Linux via its
affinity and NUMA-related APIs that the OMPI jobs should be scheduled in
such a way that they fill the cores on particular sockets, and try to
use adjacent sockets?

2. I think the rankfile may be the way for me to address this issue, but
do I need to have a different rankfile for each job?  The FAQ shows the
ability to wildcard the "core" number/ID field.  Is there a way to
wildcard the socket field, but not the core field, that is tell OMPI I
don't care what socket you choose, but the job should always be mapped
onto the cores of a single socket?  The latter might not make sense for
a job using more than the number of cores per socket, but it would be
useful in that case.  On a job needing say more than 4 processes on a
quad-core, it probably makes sense to specifically tell OMPI which
sockets to use as well, to try to maintain the smallest number of
processor hops.

3. If my understanding is correct, and a rankfile will help me solve
this problem, can I safely turn on processor and memory affinity such
that the different OMPI jobs I manually launched will not vie for
affinity on the same processor cores/memory chunks?

Thank you.


Reply via email to