Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Gus Correa Wed, 5 May 2010 18:09:02 -0400

Hi Ralph

Thank you.
Yes, I will give the user/researcher that TCP solution for now,
because he needs to start running his model with Open MPI.
He bought a brand new super-duper machine, with two-way Nehalem,
48GB RAM, etc, and so far he couldn't do any work,
which is frustrating.


I googled around and found a few references to those kernel
errors reported by dmesg: "*BAD*gran_size".
Actually, I don't really know if these kernel messages are
the reason for the problems I had with Open MPI.
In any case, the issue traces back
to "mtrr" (http://en.wikipedia.org/wiki/Memory_Type_Range_Registers):

dmesg | grep mtrr
mtrr_cleanup: can not find optimal value
please specify mtrr_gran_size/mtrr_chunk_size

mtrr: type mismatch for d0000000,10000000 old: write-back new:write-combining


Eventually it backtracks to X, graphics cards and how they use memory.

There are some diagnostics and solutions proposed here:

http://osdir.com/ml/fedora-list/2009-04/msg01308.html

1) Don't start X.  I tried runlevel 3, but the same problems happened.

2) Install the NVidia driver if your graphics card is NVidia.  It is
NVidia.  I downloaded the driver, but it won't install in FC-12,
at least not out of the box.

3) Add the kernel parameter "enable_mtrr_cleanup" to grub.conf.
I tried, no game, nothing changed.

I stopped here because I am not sure I will get anywhere down this road,
or even if the OpenMPI problem on Nehalem is related to mtrr.

I am afraid this may be something screwed up really deep in the
FC-12 kernel.
(I am using 2.6.32.11-99.fc12.x86_64 #1 SMP x86_64 GNU/Linux.)

All web pointers point to ... Fedora 9, 10, 11, 12.
I didn't find any references to those kernel errors on other
Linux distros.
On our CentOS cluster there are no such kernel error messages.

OTOH, Jeff says he has Open MPI running with HT turned on,
"sm" turned on, etc on his Nehalem system.
Would Jeff be willing to disclose which
Linux distribution and kernel he uses on those boxen, perchance? :)
(Even in an off list email, if preferred, perhaps.)

If anybody else has Open MPI working with hyperthreading and "sm"
on a Nehalem box, I would appreciate any information about the
Linux distro and kernel version being used.

I don't administer the machine, but I would have stronger reasons
to convince the sys admin to switch from Fedora to
something else (CentOS, Ubuntu, Debian, whatever)
and see if it fixes the problem.
In the mid term, this would be a better solution
than running Open MPI over TCP.
At least it may be worth trying.

Thank you,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Ralph Castain wrote:

I saw similar issues in my former life when we encountered a Linux "glitch" in 
the way it handled proximity for shared memory - caused lockups under certain conditions. 
Turned out the problem was fixed in a later kernel version.

Afraid I can't remember the versions involved any more, though....

Unless speed is a critical issue, I'd fall back to using TCP for now, maybe 
have someone over there look at a different kernel rev later.


On May 5, 2010, at 11:30 AM, Gus Correa wrote:

Hi Jeff, Ralph, list.

Sorry for the long email, and the delay to answer.
I had to test MPI/reboot the machine several times
to address the questions.
Hopefully with answers to all your questions inline below.

Jeff Squyres wrote:

I'd actually be a little surprised if HT was the problem.  I run with HT 
enabled on my nehalem boxen all the time.  It's pretty surprising that Open MPI 
is causing a hard lockup of your system; user-level processes shouldn't be able 
to do that.

I hope I can do the same here!  :)

Notes:
1. With HT enabled, as you noted, Linux will just see 2x as many cores as you 
actually have.  Depending on your desired workload, this may or may not help 
you.  But that shouldn't affect the correctness of running your MPI application.

I agree and that is what I seek.
Correctness first, performance later.
I want OpenMPI to work correctly, with or without hyperthreading,
and preferably using the "sm" BTL.
In order, let's see what is possible, what works, what performs better.

***

Reporting the most recent experiments with v1.4.2,
1) hyperthreading turned ON,
2) then HT turned OFF, on the BIOS.

In both cases I tried
A) "-mca btl ^sm" and
B) without it.

(Just in case, I checked and /proc/cpuinfo reports a number of cores
consistent with the BIOS setting for HT.)

Details below, but first off,
my conclusion is that HT OFF or ON makes *NO difference*.
The problem seems to be with the "sm" btl.
When "sm" is on (default) OpenMPI breaks (at least on this computer).

################################
1) With hyperthreading turned ON:
################################

A) with -mca btl ^sm (i.e. "sm" OFF):
Ran fine with 4,8,...,128 processes and fails with 256,
due to system limit on the number of open TCP connections,
as reported before with 1.4.1.

B) withOUT any -mca parameters (i.e. "sm" ON)"
Ran fine with 4,...,32, but failed with 64 processes,
with the same segfault and syslog error messages I reported
before for both 1.4.1 and 1.4.2.
(see below)

Of course np=64 is oversubscribing, but this is just a "hello world"
lightweight test.
Moreover, in the previous experiments with both 1.4.1 and 1.4.2
the failures happened even earlier, with np = 16, which is the
exactly number of (virtual) processors with hyperthreading turned on,
i.e., with no oversubscription.

The machine returns the prompt, but hangs right after.

Could the failures be traced to some funny glitch in the
Fedora Core 12 (2.6.32.11-99.fc12.x86_6) SMP kernel?

[gus@spinoza ~]$ uname -a
Linux spinoza.ldeo.columbia.edu 2.6.32.11-99.fc12.x86_64 #1 SMP Mon Apr 5 
19:59:38 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux


********
ERROR messages:

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 a.out

Message from syslogd@spinoza at May  4 22:28:15 ...
kernel:------------[ cut here ]------------

Message from syslogd@spinoza at May  4 22:28:15 ...
kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd@spinoza at May  4 22:28:15 ...
kernel:last sysfs file: 
/sys/devices/system/cpu/cpu15/topology/physical_package_id

Message from syslogd@spinoza at May  4 22:28:15 ...
kernel:Stack:
--------------------------------------------------------------------------
mpiexec noticed that process rank 63 with PID 6587 on node 
spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Message from syslogd@spinoza at May  4 22:28:15 ...
kernel:Call Trace:

Message from syslogd@spinoza at May  4 22:28:15 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 
e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 
48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01

************
################################
2) Now with hyperthreading OFF:
################################

A) with -mca btl ^sm (i.e. "sm" OFF):
Ran fine with 4,8,...,128 processes and fails with 256,
due to system limit on the number of open TCP connections,
as reported before with 1.4.1.
This is exactly the same result as with HT ON.

B) withOUT any -mca parameters (i.e. "sm" ON)"
Ran fine with 4,...,32, but failed with 64 processes,
with the same syslog messages, but hung before showing
the Open MPI segfault message (see below).
So, again, very similar behavior as with HT ON

-------------------------------------------------------
My conclusion is that HT OFF or ON makes NO difference.
The problem seems to be with the "sm" btl.
-------------------------------------------------------

***********
ERROR MESSAGES

[root@spinoza examples]# /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 
a.out

Message from syslogd@spinoza at May  5 12:04:05 ...
kernel:------------[ cut here ]------------

Message from syslogd@spinoza at May  5 12:04:05 ...
kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd@spinoza at May  5 12:04:05 ...
kernel:last sysfs file: 
/sys/devices/system/cpu/cpu7/topology/physical_package_id

Message from syslogd@spinoza at May  5 12:04:05 ...
kernel:Stack:

Message from syslogd@spinoza at May  5 12:04:05 ...
kernel:Call Trace:




***********

2. To confirm: yes, TCP will be quite a bit slower than sm (but again, that depends on how much MPI traffic you're sending).

Thank you, the clarification is really important.
I suppose then that "sm" is preferred, if I can get it to work right.

The main goal is to run yet another atmospheric model on this machine.
It is a typical domain decomposition problem,
with a bunch of 2D arrays being exchanged
across domain boundaries at each time step.
This is the MPI traffic.
There are probably some collectives too,
but I haven't checked out the code.

3. Yes, you can disable the 2nd thread on each core via Linux, but you need 
root-level access to do it.

I have root-level access.
However, so far I only learned the BIOS way, which requires a reboot.

Doing it in Linux would be more convenient, avoiding reboots,
I suppose.
How do I do it in Linux.
Should I overwrite something in /proc ?
Something else.

Some questions:
- is the /tmp directory on your local disk?

Yes.
And there is plenty of room in the / filesystem and the
/tmp directory:

[root@spinoza ~]# ll -d /tmp
drwxrwxrwt 22 root root 4096 2010-05-05 12:36 /tmp

[root@spinoza ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_spinoza-lv_root
                     1.8T  504G  1.2T  30% /
tmpfs                  24G     0   24G   0% /dev/shm
/dev/sda1             194M   40M  144M  22% /boot


FYI, this is a standalone workstation.
MPI is not being used over any network, private or local.
It is all "inside the box".

- are there any revealing messages in /var/log/messages (or equivalent) about 
failures when the machine hangs?

Parsing kernel messages is not my favorite hobby or league.
In any case, as far as my search could go, there are just standard
kernel messages on /var/log/messages (e.g. ntpd synchronization, etc),
until the system hangs when the hello_c program fails.
Then the the log starts again with the boot process.
This behavior was repeated time and again over my several
attempts to run OpenMPI programs with the "sm" btl on.

***

However, I am suspicious of these kernel messages during boot.
Are they telling me of a memory misconfiguration, perhaps?
What do the "*BAD*gran_size: ..." mean?

Does anybody out there with a sane funnctional Nehalem system
get these funny "*BAD*gran_size: ..." lines
in " dmesg | more" output, or in /var/log/messages during boot?

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
total RAM covered: 49144M
gran_size: 64K  chunk_size: 64K         num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 128K        num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 256K        num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 512K        num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 1M  num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 2M  num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 4M  num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 8M  num_reg: 8      lose cover RAM: 45G
gran_size: 64K  chunk_size: 16M         num_reg: 8      lose cover RAM: 0G
gran_size: 64K  chunk_size: 32M         num_reg: 8      lose cover RAM: 0G
gran_size: 64K  chunk_size: 64M         num_reg: 8      lose cover RAM: 0G
gran_size: 64K  chunk_size: 128M        num_reg: 8      lose cover RAM: 0G
gran_size: 64K  chunk_size: 256M        num_reg: 8      lose cover RAM: 0G
gran_size: 64K  chunk_size: 512M        num_reg: 8      lose cover RAM: 0G
gran_size: 64K  chunk_size: 1G  num_reg: 8      lose cover RAM: 0G
*BAD*gran_size: 64K     chunk_size: 2G  num_reg: 8      lose cover RAM: -1G
gran_size: 128K         chunk_size: 128K        num_reg: 8      lose cover RAM: 
45G
gran_size: 128K         chunk_size: 256K        num_reg: 8      lose cover RAM: 
45G
gran_size: 128K         chunk_size: 512K        num_reg: 8      lose cover RAM: 
45G
gran_size: 128K         chunk_size: 1M  num_reg: 8      lose cover RAM: 45G
gran_size: 128K         chunk_size: 2M  num_reg: 8      lose cover RAM: 45G
gran_size: 128K         chunk_size: 4M  num_reg: 8      lose cover RAM: 45G
gran_size: 128K         chunk_size: 8M  num_reg: 8      lose cover RAM: 45G
gran_size: 128K         chunk_size: 16M         num_reg: 8      lose cover RAM: 
0G
gran_size: 128K         chunk_size: 32M         num_reg: 8      lose cover RAM: 
0G
gran_size: 128K         chunk_size: 64M         num_reg: 8      lose cover RAM: 
0G
gran_size: 128K         chunk_size: 128M        num_reg: 8      lose cover RAM: 
0G
gran_size: 128K         chunk_size: 256M        num_reg: 8      lose cover RAM: 
0G
gran_size: 128K         chunk_size: 512M        num_reg: 8      lose cover RAM: 
0G
gran_size: 128K         chunk_size: 1G  num_reg: 8      lose cover RAM: 0G
*BAD*gran_size: 128K    chunk_size: 2G  num_reg: 8      lose cover RAM: -1G


... and it goes on and on ... then stops with


*BAD*gran_size: 512M    chunk_size: 2G  num_reg: 8      lose cover RAM: -520M
gran_size: 1G   chunk_size: 1G  num_reg: 6      lose cover RAM: 1016M
gran_size: 1G   chunk_size: 2G  num_reg: 7      lose cover RAM: 1016M
gran_size: 2G   chunk_size: 2G  num_reg: 5      lose cover RAM: 2040M
mtrr_cleanup: can not find optimal value
please specify mtrr_gran_size/mtrr_chunk_size

...

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

I know about the finicky memory configuration details
required by Nehalem, but I didn't put together this system,
or opened the box to see what is inside yet.

Kernel experts and Nehalem Pros:

If something sounds suspicious, please tell me, and I will
check if the memory modules are the right ones and correctly
distributed on the slots.

**

Thank you very much,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

On May 4, 2010, at 8:35 PM, Gus Correa wrote:

Hi Douglas

Yes, very helpful indeed!

The machine here is a two-way quad-core, and /proc/cpuinfo shows 16
processors, twice as much as the physical cores,
just like you see on yours.
So, HT is turned on for sure.

The security guard opened the office door for me,
and I could reboot that machine.
It's called Spinoza.  Maybe that's why it is locked.
Now the door is locked again, so I will have to wait until tomorrow
to play around with the BIOS settings.

I will remember the BIOS double negative that you pointed out:
"When Disabled only one thread per core is enabled"
Ain't that English funny?
So far, I can't get no satisfaction.
Hence, let's see if Ralph's suggestion works.
Never get no hyperthreading turned on,
and you ain't have no problems with Open MPI.  :)

Many thanks!
Have a great Halifax Spring time!

Cheers,
Gus

Douglas Guptill wrote:

On Tue, May 04, 2010 at 05:34:40PM -0600, Ralph Castain wrote:

On May 4, 2010, at 4:51 PM, Gus Correa wrote:

Hi Ralph

Ralph Castain wrote:

One possibility is that the sm btl might not like that you have hyperthreading 
enabled.

I remember that hyperthreading was discussed months ago,
in the previous incarnation of this problem/thread/discussion on "Nehalem vs. Open 
MPI".
(It sounds like one of those supreme court cases ... )

I don't really administer that machine,
or any machine with hyperthreading,
so I am not much familiar to the HT nitty-gritty.
How do I turn off hyperthreading?
Is it a BIOS or a Linux thing?
I may try that.

I believe it can be turned off via an admin-level cmd, but I'm not certain 
about it

The challenge was too great to resist, so I yielded, and rebooted my
Nehalem (Core i7 920 @ 2.67 GHz) to confirm my thoughts on the issue.

Entering the BIOS setup by pressing "DEL", and "right-arrowing" over
to "Advanced", then "down arrow" to "CPU configuration", I found a
setting called "Intel (R) HT Technology".  The help dialogue says
"When Disabled only one thread per core is enabled".

Mine is "Enabled", and I see 8 cpus.  The Core i7, to my
understanding, is a 4 core chip.

Hope that helps,
Douglas.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Reply via email to