Hi Jeff

Thanks for jumping in!  :)
And for your clarifications too, of course.

How does the efficiency of loopback
(let's say, over TCP and over IB) compare with "sm"?

FYI, I do NOT see the problem reported by Matthew et al.
on our AMD Opteron Shanghai dual-socket quad-core.
They run a quite outdated
CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2.
and OpenMPI 1.3.2.
(I've been lazy to upgrade, it is a production machine.)

I could run all three OpenMPI test programs (hello_c, ring_c, and connectivity_c) on all 8 cores on a single node WITH "sm" turned ON
with no problem whatsoever.
(I also had IB turned on, but I can run again
with sm only if you think this can make a difference.)

Moreover, all works fine if I oversuscribe up to 256 processes on
one node.
Beyond that I get segmentation fault (not hanging) sometimes,
but not always.
I understand that extreme oversubscription is a no-no.

Moreover, on the screenshots that Matthew posted, the cores
were at 100% CPU utilization on the simple connectivity_c
(although this was when he had "sm" turned on on Nehalem).
On my platform I don't get anything more than 3% or so.

Matthew: Which levels of CPU utilization do you see now?

My two speculative cents.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Jeff Squyres wrote:
On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:

A couple of questions to the OpenMPI pros:
If shared memory ("sm") is turned off on a standalone computer,
which mechanism is used for MPI communication?
TCP via loopback port?  Other?

Whatever device supports node-local loopback.  TCP is one; some OpenFabrics 
devices do, too.

Why wouldn't shared memory work right on Nehalem?
(That is probably distressing for Mark, Matthew, and other Nehalem owners.)

To be clear, we don't know that this is a Nehalem-specific problem.  We 
actually thought it was an AMD-specific problem, but these results are 
interesting.  We've had a notoriously difficult time reproducing the problem 
reliably, which is why it hasn't been fixed yet.  :-(

The best luck so far in reproducing the problem has been with GCC 4.4.x (at 
Sun).  I've been trying for a few days to install GCC 4.4 on my machines 
without much luck yet.  Still working on it...


Reply via email to