Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Doug Reeder Tue, 4 May 2010 22:33:04 -0400

Hello,

I have a mac with two quad core nehalem chips (8 cores). The sysctlcommand shows 16 cpus (apparently w/ hyperthreading). I have a finiteelement code that runs in parallel using openmpi. Running on thesingle machine using openmpi -np 8 runs in about 2/3 time that runningwith -np 16 does. The program is very well optimized for parallelprocessing so I strongly suspect that hyperthreading is not helping.The program fairly aggressively uses 100% of each cpu it is on so Idon't think hyperthreading gets much of a chance to split the cpuactivity. I would certainly welcome input/insight from an intelhardware engineer. I make sure that I don't ask for more processorsthan there are physical cores and that seems to work.


Doug Reeder
On May 4, 2010, at 7:06 PM, Gus Correa wrote:

Hi Ralph

Thank you so much for your help.

You are right, paffinity is turned off (default):

**************

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/ompi_info --param opal all |grep paffinityMCA opal: parameter "opal_paffinity_alone" (currentvalue: "0", data source: default value, synonyms:mpi_paffinity_alone, mpi_paffinity_alone)

**************

I will try your suggestion to turn off HT tomorrow,
and report back here.
Douglas Guptill kindly sent a recipe to turn HT off via BIOS settings.

Cheers,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Ralph Castain wrote:

On May 4, 2010, at 4:51 PM, Gus Correa wrote:

Hi Ralph

Ralph Castain wrote:
One possibility is that the sm btl might not like that you havehyperthreading enabled.
I remember that hyperthreading was discussed months ago,
in the previous incarnation of this problem/thread/discussion on"Nehalem vs. Open MPI".
(It sounds like one of those supreme court cases ... )

I don't really administer that machine,
or any machine with hyperthreading,
so I am not much familiar to the HT nitty-gritty.
How do I turn off hyperthreading?
Is it a BIOS or a Linux thing?
I may try that.

I believe it can be turned off via an admin-level cmd, but I'm notcertain about it

Another thing to check: do you have any paffinity settings turnedon

(e.g., mpi_paffinity_alone)?

I didn't turn on or off any paffinity setting explicitly,
either in the command line or in the mca config file.
All that I did on the tests was to turn off "sm",
or just use the default settings.
I wonder if paffinity is on by default, is it?
Should I turn it off?

It is off by default - I mention it because sometimes people haveit set in the default MCA param file and don't realize it is on.Sounds okay here, though.

Our paffinity system doesn't handle hyperthreading at this time.

OK, so *if* paffinity is on by default (Is it?),
and hyperthreading is also on, as it is now,
I must turn off one of them, maybe both, right?
I may go combinatorial about this tomorrow.
Can't do it today.
Darn locked office door!

I would say don't worry about the paffinity right now - sounds likeit is off. You can always check, though, by running "ompi_info --param opal all" and checking for the setting of theopal_paffinity_alone variable

I'm just suspicious of the HT since you have a quad-core machine,

and the limit where things work seems to be 4...

It may be.

If you tell me how to turn off HT (I'll google around for itmeanwhile),

I will do it tomorrow, if I get a chance to
hard reboot that pesky machine now locked behind a door.

Yeah, I'm beginning to believe it is the HT that is causing theproblem...

Thanks again for your help.

Gus

On May 4, 2010, at 3:44 PM, Gus Correa wrote:

Hi Jeff

Sure, I will certainly try v1.4.2.
I am downloading it right now.
As of this morning, when I first downloaded,
the web site still had 1.4.1.
Maybe I should have refreshed the web page on my browser.

I will tell you how it goes.

Gus

Jeff Squyres wrote:

Gus -- Can you try v1.4.2 which was just released today?
On May 4, 2010, at 4:18 PM, Gus Correa wrote:

Hi Ralph

Thank you very much.
The "-mca btl ^sm" workaround seems to have solved the problem,
at least for the little hello_c.c test.
I just ran it fine up to 128 processes.

I confess I am puzzled by this workaround.
* Why should we turn off "sm" in a standalone machine,
where everything is supposed to operate via shared memory?
* Do I incur in a performance penalty by not using "sm"?
* What other mechanism is actually used by OpenMPI for process
communication in this case?

It seems to be using tcp, because when I try -np 256 I getthis error:

[spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limiton number

of network connections a process can open was reached in file
../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
--------------------------------------------------------------------------

Error: system limit exceeded on number of network connectionsthat can

be open
This can be resolved by setting the mca parameter
opal_set_max_sys_limits to 1,

increasing your limit descriptor setting (using limit orulimit commands),

or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------

Anyway, no big deal, because we don't intend to oversubrcribetheprocessors on real jobs anyway (and the very error messagesuggests a

workaround to increase np, if needed).

Many thanks,
Gus Correa

Ralph Castain wrote:

I would certainly try it -mca btl ^sm and see if that solvesthe problem.


On May 4, 2010, at 2:38 PM, Eugene Loh wrote:

Gus Correa wrote:

Dear Open MPI experts

I need your help to get Open MPI right on a standalone
machine with Nehalem processors.

How to tweak the mca parameters to avoid problems
with Nehalem (and perhaps AMD processors also),
where MPI programs hang, was discussed here before.

However, I lost track of the details, how to work aroundthe problem,

and if it was fully fixed already perhaps.

Yes, perhaps the problem you're seeing is not what youremember being discussed.

Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043. It's presumably fixed.

I am now facing the problem directly on a single Nehalem box.

I installed OpenMPI 1.4.1 from source,
and compiled the test hello_c.c with mpicc.
Then I tried to run it with:

1) mpirun -np 4 a.out
It ran OK (but seemed to be slow).

2) mpirun -np 16 a.out
It hung, and brought the machine to a halt.

Any words of wisdom are appreciated.

More info:

* OpenMPI 1.4.1 installed from source (tarball from yoursite).

* Compilers are gcc/g++/gfortran 4.4.3-4.
* OS is Fedora Core 12.
* The machine is a Dell box with Intel Xeon 5540 (quad core)
processors on a two-way motherboard and 48GB of RAM.
* /proc/cpuinfo indicates that hyperthreading is turned on.
(I can see 16 "processors".)

**

What should I do?

Use -mca btl ^sm  ?

Use -mca btl -mca btl_sm_num_fifos=some_number ? (Whichnumber?)

Use Both?
Do something else?

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Reply via email to