Hello,
I have a mac with two quad core nehalem chips (8 cores). The sysctl
command shows 16 cpus (apparently w/ hyperthreading). I have a finite
element code that runs in parallel using openmpi. Running on the
single machine using openmpi -np 8 runs in about 2/3 time that running
with -np 16 does. The program is very well optimized for parallel
processing so I strongly suspect that hyperthreading is not helping.
The program fairly aggressively uses 100% of each cpu it is on so I
don't think hyperthreading gets much of a chance to split the cpu
activity. I would certainly welcome input/insight from an intel
hardware engineer. I make sure that I don't ask for more processors
than there are physical cores and that seems to work.
Doug Reeder
On May 4, 2010, at 7:06 PM, Gus Correa wrote:
Hi Ralph
Thank you so much for your help.
You are right, paffinity is turned off (default):
**************
/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/ompi_info --param opal all |
grep paffinity
MCA opal: parameter "opal_paffinity_alone" (current
value: "0", data source: default value, synonyms:
mpi_paffinity_alone, mpi_paffinity_alone)
**************
I will try your suggestion to turn off HT tomorrow,
and report back here.
Douglas Guptill kindly sent a recipe to turn HT off via BIOS settings.
Cheers,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Ralph Castain wrote:
On May 4, 2010, at 4:51 PM, Gus Correa wrote:
Hi Ralph
Ralph Castain wrote:
One possibility is that the sm btl might not like that you have
hyperthreading enabled.
I remember that hyperthreading was discussed months ago,
in the previous incarnation of this problem/thread/discussion on
"Nehalem vs. Open MPI".
(It sounds like one of those supreme court cases ... )
I don't really administer that machine,
or any machine with hyperthreading,
so I am not much familiar to the HT nitty-gritty.
How do I turn off hyperthreading?
Is it a BIOS or a Linux thing?
I may try that.
I believe it can be turned off via an admin-level cmd, but I'm not
certain about it
Another thing to check: do you have any paffinity settings turned
on
(e.g., mpi_paffinity_alone)?
I didn't turn on or off any paffinity setting explicitly,
either in the command line or in the mca config file.
All that I did on the tests was to turn off "sm",
or just use the default settings.
I wonder if paffinity is on by default, is it?
Should I turn it off?
It is off by default - I mention it because sometimes people have
it set in the default MCA param file and don't realize it is on.
Sounds okay here, though.
Our paffinity system doesn't handle hyperthreading at this time.
OK, so *if* paffinity is on by default (Is it?),
and hyperthreading is also on, as it is now,
I must turn off one of them, maybe both, right?
I may go combinatorial about this tomorrow.
Can't do it today.
Darn locked office door!
I would say don't worry about the paffinity right now - sounds like
it is off. You can always check, though, by running "ompi_info --
param opal all" and checking for the setting of the
opal_paffinity_alone variable
I'm just suspicious of the HT since you have a quad-core machine,
and the limit where things work seems to be 4...
It may be.
If you tell me how to turn off HT (I'll google around for it
meanwhile),
I will do it tomorrow, if I get a chance to
hard reboot that pesky machine now locked behind a door.
Yeah, I'm beginning to believe it is the HT that is causing the
problem...
Thanks again for your help.
Gus
On May 4, 2010, at 3:44 PM, Gus Correa wrote:
Hi Jeff
Sure, I will certainly try v1.4.2.
I am downloading it right now.
As of this morning, when I first downloaded,
the web site still had 1.4.1.
Maybe I should have refreshed the web page on my browser.
I will tell you how it goes.
Gus
Jeff Squyres wrote:
Gus -- Can you try v1.4.2 which was just released today?
On May 4, 2010, at 4:18 PM, Gus Correa wrote:
Hi Ralph
Thank you very much.
The "-mca btl ^sm" workaround seems to have solved the problem,
at least for the little hello_c.c test.
I just ran it fine up to 128 processes.
I confess I am puzzled by this workaround.
* Why should we turn off "sm" in a standalone machine,
where everything is supposed to operate via shared memory?
* Do I incur in a performance penalty by not using "sm"?
* What other mechanism is actually used by OpenMPI for process
communication in this case?
It seems to be using tcp, because when I try -np 256 I get
this error:
[spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit
on number
of network connections a process can open was reached in file
../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections
that can
be open
This can be resolved by setting the mca parameter
opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or
ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------
Anyway, no big deal, because we don't intend to oversubrcribe
the
processors on real jobs anyway (and the very error message
suggests a
workaround to increase np, if needed).
Many thanks,
Gus Correa
Ralph Castain wrote:
I would certainly try it -mca btl ^sm and see if that solves
the problem.
On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
Gus Correa wrote:
Dear Open MPI experts
I need your help to get Open MPI right on a standalone
machine with Nehalem processors.
How to tweak the mca parameters to avoid problems
with Nehalem (and perhaps AMD processors also),
where MPI programs hang, was discussed here before.
However, I lost track of the details, how to work around
the problem,
and if it was fully fixed already perhaps.
Yes, perhaps the problem you're seeing is not what you
remember being discussed.
Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043
. It's presumably fixed.
I am now facing the problem directly on a single Nehalem box.
I installed OpenMPI 1.4.1 from source,
and compiled the test hello_c.c with mpicc.
Then I tried to run it with:
1) mpirun -np 4 a.out
It ran OK (but seemed to be slow).
2) mpirun -np 16 a.out
It hung, and brought the machine to a halt.
Any words of wisdom are appreciated.
More info:
* OpenMPI 1.4.1 installed from source (tarball from your
site).
* Compilers are gcc/g++/gfortran 4.4.3-4.
* OS is Fedora Core 12.
* The machine is a Dell box with Intel Xeon 5540 (quad core)
processors on a two-way motherboard and 48GB of RAM.
* /proc/cpuinfo indicates that hyperthreading is turned on.
(I can see 16 "processors".)
**
What should I do?
Use -mca btl ^sm ?
Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which
number?)
Use Both?
Do something else?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users