Hi Jeff

Sorry, same problem with v1.4.2.
Without any mca parameters set (i.e. withOUT -mca btl ^sm),
hello_c.c runs OK for np  = 4 and 8.
(However slower than with the "sm" turned off in 1.4.1,
as suggested by Ralph an hour ago.)

Nevertheless, when I try np=16 it segfaults,
with the syslog messages below.
After that the machine goes south,
I can ping it, but not ssh to it.
This was the same behavior I saw and reported when using 1.4.1.

I can't run anything again today, because the machine hung again,
needs a hard reboot, and is locked in an office that
I don't have the keys of.  :)

Anyway, I can live with the -mca btl ^sm.
This way it seems to use tcp, right?
I hope the impact on performance is not too large.
I will try that on 1.4.2 tomorrow, if somebody opens that office.

Regards,
Gus
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpirun -np 4 a.out

Hello, world, I am 0 of 4
Hello, world, I am 1 of 4
Hello, world, I am 2 of 4
Hello, world, I am 3 of 4

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpirun -np 8 a.out

Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpirun -np 16 a.out

--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 14716 on node spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Message from syslogd@spinoza at May  4 18:02:56 ...
 kernel:------------[ cut here ]------------

Message from syslogd@spinoza at May  4 18:02:56 ...
 kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd@spinoza at May  4 18:02:56 ...
kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/physical_package_id

Message from syslogd@spinoza at May  4 18:02:56 ...
 kernel:Stack:

Message from syslogd@spinoza at May  4 18:02:56 ...
 kernel:Call Trace:

Message from syslogd@spinoza at May  4 18:02:56 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01


Gus Correa wrote:
Hi Jeff

Sure, I will certainly try v1.4.2.
I am downloading it right now.
As of this morning, when I first downloaded,
the web site still had 1.4.1.
Maybe I should have refreshed the web page on my browser.

I will tell you how it goes.

Gus

Jeff Squyres wrote:
Gus -- Can you try v1.4.2 which was just released today?

On May 4, 2010, at 4:18 PM, Gus Correa wrote:

Hi Ralph

Thank you very much.
The "-mca btl ^sm" workaround seems to have solved the problem,
at least for the little hello_c.c test.
I just ran it fine up to 128 processes.

I confess I am puzzled by this workaround.
* Why should we turn off "sm" in a standalone machine,
where everything is supposed to operate via shared memory?
* Do I incur in a performance penalty by not using "sm"?
* What other mechanism is actually used by OpenMPI for process
communication in this case?

It seems to be using tcp, because when I try -np 256 I get this error:

[spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
of network connections a process can open was reached in file
../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections that can
be open
This can be resolved by setting the mca parameter
opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------

Anyway, no big deal, because we don't intend to oversubrcribe the
processors on real jobs anyway (and the very error message suggests a
workaround to increase np, if needed).

Many thanks,
Gus Correa

Ralph Castain wrote:
I would certainly try it -mca btl ^sm and see if that solves the problem.

On May 4, 2010, at 2:38 PM, Eugene Loh wrote:

Gus Correa wrote:

Dear Open MPI experts

I need your help to get Open MPI right on a standalone
machine with Nehalem processors.

How to tweak the mca parameters to avoid problems
with Nehalem (and perhaps AMD processors also),
where MPI programs hang, was discussed here before.

However, I lost track of the details, how to work around the problem,
and if it was fully fixed already perhaps.
Yes, perhaps the problem you're seeing is not what you remember being discussed.

Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed.

I am now facing the problem directly on a single Nehalem box.

I installed OpenMPI 1.4.1 from source,
and compiled the test hello_c.c with mpicc.
Then I tried to run it with:

1) mpirun -np 4 a.out
It ran OK (but seemed to be slow).

2) mpirun -np 16 a.out
It hung, and brought the machine to a halt.

Any words of wisdom are appreciated.

More info:

* OpenMPI 1.4.1 installed from source (tarball from your site).
* Compilers are gcc/g++/gfortran 4.4.3-4.
* OS is Fedora Core 12.
* The machine is a Dell box with Intel Xeon 5540 (quad core)
processors on a two-way motherboard and 48GB of RAM.
* /proc/cpuinfo indicates that hyperthreading is turned on.
(I can see 16 "processors".)

**

What should I do?

Use -mca btl ^sm  ?
Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
Use Both?
Do something else?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to