There's a few issues involved here:

- Brian was pointing out that AMDs are NUMA (and Intel may well go NUMA someday -- scaling up to hundreds of cores, unless something quite unexpected happens in terms of computer architectures, simply does not scale in UMA architectures). So each core is *not* created equal -- mainly in terms of locality to resources. If MPI allocates resources local to core X and you end up pinning yourself to core Y, what happens if X and Y are not local to each other? You've just killed your performance because of the latency hit to get to MPI- (or other) allocated resources.

- If you're going to use the Linux sched_setaffinity(), beware that this function has changed signatures multiple times over the history of Linux (there are at least 3 versions that I'm aware of). Shameless plug: try the Portable Linux Processor Affinity (PLPA) micro-library that provides a simple, consistent interface to Linux processor affinity regardless of your version of Linux kernel and glibc (http://www.open-mpi.org/software/plpa/). The library has nothing to do with MPI and can be used in any application that wants to use paffinity.

- There's also the issue that some clusters -- particularly those setup for high-core-count hosts -- may well be setup to allow multiple MPI jobs to land on the same host. In that case, how does the MPI app know which core to bind itself to? If every MPI job starts binding itself to core 0 and counting upwards, the case where multiple MPI jobs land on the same host becomes a disaster.

- There's also the issue that the BIOS determines core/socket order mapping to Linux virtual processor IDs. Linux virtual processor 0 is always socket 0, core 0. But what is linux virtual processor 1? Is it socket 0, core 1, or socket 1, core 0? This stuff is quite complicated to figure out, and can have large implications (particularly in NUMA environments).



On Nov 29, 2006, at 1:08 AM, Durga Choudhury wrote:

Brian

But does it matter which core the process gets bound to? They are all identical, and as long as the task is parallelized in equal chunks (that's the key part), it should not matter. The last time I had to do this, the problem had to do with real-time processing of a very large radar image. My approach was to spawn *ONE* MPI process per blade and 12 threads (to utilize the 12 processors). Inside the task entry point of each pthread, I called sched_setaffinity(). Then I set the scheduling algorithm to real time with a very high task priority to avoid preemption. It turns out that the last two steps did not buy me much because ours was a lean, embedded architecture anyway, designed to run real-time applications, but I definitely got a speed up from the task distribution.

It sure would be very nice for openMPI to have this feature; no questions about that. All I am saying is: if a user wants it today, a reasonable workaround is available so he/she does not need to wait.

This is my $0.01's worth, since I am probably a lot less experienced.

Durga


On 11/29/06, Brian W. Barrett <bbarr...@lanl.gov> wrote: It would be difficult to do well without some MPI help, in my
opinion.  You certainly could use the Linux processor affinity API
directly in the MPI application.  But how would the process know
which core to bind to?  It could wait until after MPI_INIT and call
MPI_COMM_RANK, but MPI implementations allocate many of their
resources during MPI_INIT, so there's high potential of the resources
(ie, memory) ending up associated with a different processor than the
one the process gets pinned to.  That isn't a big deal on Intel
machines, but is a major issue for AMD processors.

Just my $0.02, anyway.

Brian

On Nov 28, 2006, at 6:09 PM, Durga Choudhury wrote:

> Jeff (and everybody else)
>
> First of all, pardon me if this is a stupid comment; I am learning
> the nuts-and-bolts of parallel programming; but my comment is as
> follows:
>
> Why can't this be done *outside* openMPI, by calling Linux's
> processor affinity APIs directly? I work with a blade server kind
> of archirecture, where each blade has 12 CPUs. I use pthread within
> each blade and MPI to talk across blades. I use the Linux system
> calls to attach a thread to a specific CPU and it seems to work
> fine. The only drawback is: it makes the code unportable to a
> different OS. But even if you implemented paffinity within openMPI,
> the code will still be unportable to a different implementation of
> MPI, which, as is, it is not.
>
> Hope this helps to the original poster.
>
> Durga
>
>
> On 11/28/06, Jeff Squyres < jsquy...@cisco.com> wrote: There is not,
> right now.  However, this is mainly because back when I
> implemented the processor affinity stuff in OMPI (well over a year
> ago), no one had any opinions on exactly what interface to expose to
> the use.  :-)
>
> So right now there's only this lame control:
>
>       http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>
> I am not opposed to implementing more flexible processor affinity
> controls, but the Big Discussion over the past few months is exactly
> how to expose it to the end user.  There have been several formats
> proposed (e.g., mpirun command line parameters, magic MPI attributes,
> MCA parameters, etc.), but nothing that has been "good" and "right".
> So here's the time to chime in -- anyone have any opinions on this?
>
>
>
> On Nov 25, 2006, at 9:31 AM, shap...@isp.nsc.ru wrote:
>
> > Hello,
> > i cant figure out, is there a way with open-mpi to bind all
> > threads on a given node to a specified subset of CPUs.
> > For example, on a multi-socket multi-core machine, i want to use
> > only a single core on each CPU.
> > Thank You.
> >
> > Best Regards,
> > Alexander Shaposhnikov
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Devil wanted omnipresence;
> He therefore created communists.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Devil wanted omnipresence;
He therefore created communists.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

Reply via email to