Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Rolf vandeVaart
Thanks for the report.  CUDA-aware Open MPI does not currently support doing 
reduction operations on GPU memory.
Is this a feature you would be interested in?

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>Sent: Friday, November 29, 2013 11:24 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>
>Hi users list,
>
>I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>
>Attached, you will find an example code file, to reproduce the bug.
>The point is that MPI_Reduce with normal CPU memory fully works but the
>use of GPU memory leads to a segfault. (GPU memory is used when defining
>USE_GPU).
>
>The segfault looks like this:
>
>[peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>Signal: Segmentation fault (11) [peak64g-36:25527] Signal code: Invalid
>permissions (2) [peak64g-36:25527] Failing at address: 0x600100200 [peak64g-
>36:25527] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>[0x7ff2abdb24a0]
>[peak64g-36:25527] [ 1]
>/data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>[0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>/data/zaspel/openmpi-
>1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_
>basic_linear+0x371)
>[0x7ff2a5987531]
>[peak64g-36:25527] [ 3]
>/data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>[0x7ff2ac499d55]
>[peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction() [0x400ca0]
>[peak64g-36:25527] [ 5]
>/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7ff2abd9d76d]
>[peak64g-36:25527] [ 6] /home/zaspel/testMPI/test_reduction() [0x400af9]
>[peak64g-36:25527] *** End of error message ***
>--
>mpirun noticed that process rank 0 with PID 25527 on node peak64g-36 exited
>on signal 11 (Segmentation fault).
>--
>
>Best regards,
>
>Peter
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Peter Zaspel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Rolf,

OK, I didn't know that. Sorry.

Yes, it would be a pretty important feature in cases when you are doing
reduction operations on many, many entries in parallel. Therefore, each
reduction is not very complex or time-consuming but potentially hundreds
of thousands reductions are done at the same time. This is definitely a
point where a CUDA-aware implementation can give some performance
improvements.

I'm curious: Rather complex operations like allgatherv are CUDA-aware,
but a reduction is not. Is there a reasoning for this? Is there some
documentation, which MPI calls are CUDA-aware and which not?

Best regards

Peter



On 12/02/2013 02:18 PM, Rolf vandeVaart wrote:
> Thanks for the report.  CUDA-aware Open MPI does not currently support doing 
> reduction operations on GPU memory.
> Is this a feature you would be interested in?
> 
> Rolf
> 
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>> Sent: Friday, November 29, 2013 11:24 AM
>> To: us...@open-mpi.org
>> Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>>
>> Hi users list,
>>
>> I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>> implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>>
>> Attached, you will find an example code file, to reproduce the bug.
>> The point is that MPI_Reduce with normal CPU memory fully works but the
>> use of GPU memory leads to a segfault. (GPU memory is used when defining
>> USE_GPU).
>>
>> The segfault looks like this:
>>
>> [peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>> Signal: Segmentation fault (11) [peak64g-36:25527] Signal code: Invalid
>> permissions (2) [peak64g-36:25527] Failing at address: 0x600100200 [peak64g-
>> 36:25527] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>> [0x7ff2abdb24a0]
>> [peak64g-36:25527] [ 1]
>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>> [0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>> /data/zaspel/openmpi-
>> 1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_
>> basic_linear+0x371)
>> [0x7ff2a5987531]
>> [peak64g-36:25527] [ 3]
>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>> [0x7ff2ac499d55]
>> [peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction() [0x400ca0]
>> [peak64g-36:25527] [ 5]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7ff2abd9d76d]
>> [peak64g-36:25527] [ 6] /home/zaspel/testMPI/test_reduction() [0x400af9]
>> [peak64g-36:25527] *** End of error message ***
>> --
>> mpirun noticed that process rank 0 with PID 25527 on node peak64g-36 exited
>> on signal 11 (Segmentation fault).
>> --
>>
>> Best regards,
>>
>> Peter
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


- -- 
Dipl.-Inform. Peter Zaspel
Institut fuer Numerische Simulation, Universitaet Bonn
Wegelerstr.6, 53115 Bonn, Germany
tel: +49 228 73-2748   mailto:zas...@ins.uni-bonn.de
fax: +49 228 73-7527   http://wissrech.ins.uni-bonn.de/people/zaspel.html

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSnIsjAAoJEKPU5iaGEeWb8P4P/iJBmdEev/jK0wpTkM0Fi1Dt
BXaJjDKUOaNVxrvQXJPtY1g6AZUWphndi26Y5SP4T7JyvF2isHtjwJq6KiCBJ4KW
KYEga3y8m8o1hocqoW465EkVaibo5zHqXcX7yzVGqkWb/1LwZJh9zjrGBhjPoFzT
JwuEaw7rq1DSn9QeQQPB+CnQsCrKuef5MqDQCfNcBFSoifYks32cdj2l5+Ye/Ltx
vaxPi7VeQuWGcPlvAIE4rdgQVjV3IS+1WcxiMSpUoj2D1IgLDveXWdUlRFjxwEu8
gmRxKMAH4A4WfvpppQYGV9h49kim8EZHfVtHf7c+jRRPDJEDLPdmOltkAlfENL5e
GroMx5PFUqWRpBYoFPh51XqBak9uqai3tD/R2YdBITufRC/UvrfIq0nYyKsnOLUc
0VXejoRJRMuRrJbjHJMtT+EZsln0jaoRuNERbikCwlFvkNevSpcHnC+SNIN73KUY
99g+hwtxdk4oIH4W+YmRlzbKPRBxiTTw9VjufIwo0EcFoI9JfiVbFpXGDTZfUu6x
Z088fu3hCA/q5UoXS1NsDHWUywzkrWsnANSQHXIKXK8jMnounX1kGZ7NH1eA3rrF
IX+EqBybTyrbUQb+XDy3cltBeXFiMxTfN0f4KN8yATol7qeSIpxeeYf5NMT/eBn/
uEWxs9hiQW1IYJ4q3F1S
=Wr/G
-END PGP SIGNATURE-


Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Rolf vandeVaart
Hi Peter:
The reason behind not having the reduction support (I believe) was just the 
complexity of adding it to the code.  I will at least submit a ticket so we can 
look at it again.

Here is a link to FAQ which lists the APIs which are CUDA-aware.  
http://www.open-mpi.org/faq/?category=running#mpi-cuda-support

Regards,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>Sent: Monday, December 02, 2013 8:29 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>
>* PGP Signed by an unknown key
>
>Hi Rolf,
>
>OK, I didn't know that. Sorry.
>
>Yes, it would be a pretty important feature in cases when you are doing
>reduction operations on many, many entries in parallel. Therefore, each
>reduction is not very complex or time-consuming but potentially hundreds of
>thousands reductions are done at the same time. This is definitely a point
>where a CUDA-aware implementation can give some performance
>improvements.
>
>I'm curious: Rather complex operations like allgatherv are CUDA-aware, but a
>reduction is not. Is there a reasoning for this? Is there some documentation,
>which MPI calls are CUDA-aware and which not?
>
>Best regards
>
>Peter
>
>
>
>On 12/02/2013 02:18 PM, Rolf vandeVaart wrote:
>> Thanks for the report.  CUDA-aware Open MPI does not currently support
>doing reduction operations on GPU memory.
>> Is this a feature you would be interested in?
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter
>>> Zaspel
>>> Sent: Friday, November 29, 2013 11:24 AM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>>>
>>> Hi users list,
>>>
>>> I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>>> implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>>>
>>> Attached, you will find an example code file, to reproduce the bug.
>>> The point is that MPI_Reduce with normal CPU memory fully works but
>>> the use of GPU memory leads to a segfault. (GPU memory is used when
>>> defining USE_GPU).
>>>
>>> The segfault looks like this:
>>>
>>> [peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>>> Signal: Segmentation fault (11) [peak64g-36:25527] Signal code:
>>> Invalid permissions (2) [peak64g-36:25527] Failing at address:
>>> 0x600100200 [peak64g- 36:25527] [ 0]
>>> /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>>> [0x7ff2abdb24a0]
>>> [peak64g-36:25527] [ 1]
>>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>>> [0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>>> /data/zaspel/openmpi-
>>>
>1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intr
>>> a_
>>> basic_linear+0x371)
>>> [0x7ff2a5987531]
>>> [peak64g-36:25527] [ 3]
>>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>>> [0x7ff2ac499d55]
>>> [peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction()
>>> [0x400ca0] [peak64g-36:25527] [ 5]
>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
>>> [0x7ff2abd9d76d] [peak64g-36:25527] [ 6]
>>> /home/zaspel/testMPI/test_reduction() [0x400af9] [peak64g-36:25527]
>>> *** End of error message ***
>>> -
>>> - mpirun noticed that process rank 0 with PID 25527 on node
>>> peak64g-36 exited on signal 11 (Segmentation fault).
>>> -
>>> -
>>>
>>> Best regards,
>>>
>>> Peter
>> --
>> - This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information.  Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> --
>> - ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>--
>Dipl.-Inform. Peter Zaspel
>Institut fuer Numerische Simulation, Universitaet Bonn Wegelerstr.6, 53115
>Bonn, Germany
>tel: +49 228 73-2748   mailto:zas...@ins.uni-bonn.de
>fax: +49 228 73-7527   http://wissrech.ins.uni-bonn.de/people/zaspel.html
>
>* Unknown Key
>* 0x8611E59B(L)
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

2013-12-02 Thread Jeff Squyres (jsquyres)
It looks like your Fortran compiler installation is borked.  Have you tested 
with the same test program that configure used?

   program main

   end

Put that in a simple "conftest.f" file, and try the same invocation line that 
configure used:

/usr/local/bin/gfortran -o conftestconftest.f

Does that work?

If that works and does not yield the same error that configure saw, then 
perhaps there is some environment variable(s) that are/were present when you 
run configure that are not present when you try the test manually...?


On Dec 1, 2013, at 8:51 AM, Raiden Hasegawa  wrote:

> Hi All, new to the list here.  I'm running into an error while trying to 
> configure:
> 
> shell$ ./configure --prefix=/usr/local/Cellar/open-mpi/1.7.3 
> --disable-silent-rules --enable-ipv6
> 
> Here is a blurb from the config.log (which I have attached as well):
> 
> configure:29606: checking if Fortran compiler works
> configure:29635: /usr/local/bin/gfortran -o conftestconftest.f  >&5
> Undefined symbols for architecture x86_64:
>   "__gfortran_set_options", referenced from:
>   _main in cccSAmNO.o
> ld: symbol(s) not found for architecture x86_64
> collect2: error: ld returned 1 exit status
> configure:29635: $? = 1
> configure: program exited with status 1
> configure: failed program was:
> |   program main
> |
> |   end
> configure:29651: result: no
> configure:29665: error: Could not run a simple Fortran program.  Aborting.
> 
> I have tested my gfortran compiler on some simple "Hello World" programs and 
> it works just fine.  I am having trouble diagnosing the problem and any and 
> all help would be appreciated.  Here are my specs:
> 
> mac os x 10.8.4
> gcc and gfortran 4.8.2 (both installed using homebrew)
> open-mpi 1.7.3
> 
> Best,
> 
> Raiden
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] open-mpi on Mac OS 10.9 (Mavericks)

2013-12-02 Thread Jeff Squyres (jsquyres)
Karl --

Can you force the use of just the shared memory transport -- i.e., disable the 
TCP BTL?  For example:

mpirun -np 2 --mca btl sm,self hello_c

If that also hangs, can you attach a debugger and see *where* it is hanging 
inside MPI_Init()?  (In OMPI, MPI::Init() simply invokes MPI_Init())


On Nov 27, 2013, at 2:56 PM, "Meredith, Karl"  
wrote:

> /opt/trunk/apple-only/bin/ompi_info --param oob tcp --level 9
> MCA oob: parameter "oob_tcp_verbose" (current value: "0", 
> data source: default, level: 9 dev/all, type: int)
>  Verbose level for the OOB tcp component
> MCA oob: parameter "oob_tcp_peer_limit" (current value: "-1", 
> data source: default, level: 9 dev/all, type: int)
>  Maximum number of peer connections to simultaneously 
> maintain (-1 = infinite)
> MCA oob: parameter "oob_tcp_peer_retries" (current value: 
> "60", data source: default, level: 9 dev/all, type: int)
>  Number of times to try shutting down a connection 
> before giving up
> MCA oob: parameter "oob_tcp_debug" (current value: "0", data 
> source: default, level: 9 dev/all, type: int)
>  Enable (1) / disable (0) debugging output for this 
> component
> MCA oob: parameter "oob_tcp_sndbuf" (current value: "131072", 
> data source: default, level: 9 dev/all, type: int)
>  TCP socket send buffering size (in bytes)
> MCA oob: parameter "oob_tcp_rcvbuf" (current value: "131072", 
> data source: default, level: 9 dev/all, type: int)
>  TCP socket receive buffering size (in bytes)
> MCA oob: parameter "oob_tcp_if_include" (current value: "", 
> data source: default, level: 9 dev/all, type: string, synonyms: 
> oob_tcp_include)
>  Comma-delimited list of devices and/or CIDR notation 
> of networks to use for Open MPI bootstrap communication (e.g., 
> "eth0,192.168.0.0/16").  Mutually exclusive with oob_tcp_if_exclude.
> MCA oob: parameter "oob_tcp_if_exclude" (current value: "", 
> data source: default, level: 9 dev/all, type: string, synonyms: 
> oob_tcp_exclude)
>  Comma-delimited list of devices and/or CIDR notation 
> of networks to NOT use for Open MPI bootstrap communication -- all devices 
> not matching these specifications will be used (e.g., "eth0,192.168.0.0/16"). 
>  If set to a non-default value, it is mutually exclusive with 
> oob_tcp_if_include.
> MCA oob: parameter "oob_tcp_connect_sleep" (current value: 
> "1", data source: default, level: 9 dev/all, type: int)
>  Enable (1) / disable (0) random sleep for connection 
> wireup.
> MCA oob: parameter "oob_tcp_listen_mode" (current value: 
> "event", data source: default, level: 9 dev/all, type: int)
>  Mode for HNP to accept incoming connections: event, 
> listen_thread.
>  Valid values: 0:"event", 1:"listen_thread"
> MCA oob: parameter "oob_tcp_listen_thread_max_queue" (current 
> value: "10", data source: default, level: 9 dev/all, type: int)
>  High water mark for queued accepted socket list 
> size.  Used only when listen_mode is listen_thread.
> MCA oob: parameter "oob_tcp_listen_thread_wait_time" (current 
> value: "10", data source: default, level: 9 dev/all, type: int)
>  Time in milliseconds to wait before actively 
> checking for new connections when listen_mode is listen_thread.
> MCA oob: parameter "oob_tcp_static_ports" (current value: "", 
> data source: default, level: 9 dev/all, type: string)
>  Static ports for daemons and procs (IPv4)
> MCA oob: parameter "oob_tcp_dynamic_ports" (current value: 
> "", data source: default, level: 9 dev/all, type: string)
>  Range of ports to be dynamically used by daemons and 
> procs (IPv4)
> MCA oob: parameter "oob_tcp_disable_family" (current value: 
> "none", data source: default, level: 9 dev/all, type: int)
>  Disable IPv4 (4) or IPv6 (6)
>  Valid values: 0:"none", 4:"IPv4", 6:"IPv6"
> 
> /opt/trunk/apple-only/bin/ompi_info --param btl tcp --level 9
> MCA btl: parameter "btl_tcp_links" (current value: "1", data 
> source: default, level: 4 tuner/basic, type: unsigned)
> MCA btl: parameter "btl_tcp_if_include" (current value: "", 
> data source: default, level: 1 user/basic, type: string)
>  Comma-delimited list of devices and/or CIDR notation 
> of networks to use for MPI communication (e.g., "eth0,192.168.0.0/16").  
> Mutually exclusive with btl_tcp_if_exclude.
> MCA btl: parameter "btl_tcp_if_excl

Re: [OMPI users] [EXTERNAL] Re: open-mpi on Mac OS 10.9 (Mavericks)

2013-12-02 Thread Jeff Squyres (jsquyres)
Ah -- sorry, I missed this mail before I replied to the other thread (OS X Mail 
threaded them separately somehow...).

Sorry to ask you to dive deeper, but can you find out where in orte_ess.init() 
it's failing?  orte_ess.init is actually a function pointer; it's a jump-off 
point into a dlopen'ed plugin.


On Nov 25, 2013, at 11:53 AM, "Meredith, Karl"  
wrote:

> Digging a little deeper by running the code in the lldb debugger, I found 
> that the stall occurs in a call to init_orte from ompi_mpi_init.c:
>   356 /* Setup ORTE - note that we are an MPI process  */
>   357 if (ORTE_SUCCESS != (ret = orte_init(NULL, NULL, ORTE_PROC_MPI))) {
>   358 error = "ompi_mpi_init: orte_init failed";
>   359 goto error;
>   360 }
> 
> The code never returns from orte_init.
> 
> It gets stuck in orte_ess.init() called from orte_init.c:
>   126 /* initialize the RTE for this environment */
>   127 if (ORTE_SUCCESS != (ret = orte_ess.init())) {
> 
> When I step through this orte_ess_init in the lldb debugger, I actually get 
> some output from the code (no output if not using the debugger and stepping 
> through):
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_mpi_init: orte_init failed
>  --> Returned "Unable to start a daemon on the local node" (-128) instead of 
> "Success" (0)
> 
> 
> 
> Karl
> 
> 
> 
> On Nov 25, 2013, at 9:20 AM, Meredith, Karl  
> wrote:
> 
>> Here’s the back trace from lldb:
>> $ )ps -elf | grep  hello
>> 1042653210 45231 45230 4006   0  31  0  2448976   2148 -  S+ 
>>  0 ttys0020:00.01 hello_cxx 9:07AM
>> 1042653210 45232 45230 4006   0  31  0  2457168   2156 -  S+ 
>>  0 ttys0020:00.04 hello_cxx 9:07AM
>> 
>> (meredithk@meredithk-mac)-(09:15 AM Mon Nov 
>> 25)-(~/tools/openmpi-1.6.5/examples)
>> $ )lldb -p 45231
>> Attaching to process with:
>>   process attach -p 45231
>> Process 45231 stopped
>> Executable module set to 
>> "/Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx".
>> Architecture set to: x86_64-apple-macosx.
>> (lldb) bt
>> * thread #1: tid = 0x168535, 0x7fff8c1859aa 
>> libsystem_kernel.dylib`select$DARWIN_EXTSN + 10, queue = 
>> 'com.apple.main-thread, stop reason = signal SIGSTOP
>>   frame #0: 0x7fff8c1859aa libsystem_kernel.dylib`select$DARWIN_EXTSN + 
>> 10
>>   frame #1: 0x000106b73ea0 
>> libmpi.1.dylib`select_dispatch(base=0x7f84c3c0b430, 
>> arg=0x7f84c3c0b3e0, tv=0x7fff5924ca70) + 80 at select.c:174
>>   frame #2: 0x000106b3eb0f 
>> libmpi.1.dylib`opal_event_base_loop(base=0x7f84c3c0b430, flags=5) + 415 
>> at event.c:838
>> 
>> Both processors are at this state.
>> 
>> Here’s the output from otool -L ./hello_cxx:
>> 
>> $ )otool -L ./hello_cxx
>> ./hello_cxx:
>>  /Users/meredithk/tools/openmpi/lib/libmpi_cxx.1.dylib (compatibility 
>> version 2.0.0, current version 2.2.0)
>>  /Users/meredithk/tools/openmpi/lib/libmpi.1.dylib (compatibility 
>> version 2.0.0, current version 2.8.0)
>>  /opt/local/lib/libgcc/libstdc++.6.dylib (compatibility version 7.0.0, 
>> current version 7.18.0)
>>  /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
>> version 1197.1.1)
>>  /opt/local/lib/libgcc/libgcc_s.1.dylib (compatibility version 1.0.0, 
>> current version 1.0.0)
>> 
>> 
>> On Nov 25, 2013, at 9:14 AM, George Bosilca  wrote:
>> 
>>> Mac OS X 1.9 dropped support for gdb. Please report the output of lldb 
>>> instead.
>>> 
>>> Also, can you run “otool -L ./hello_cxx” and report the output.
>>> 
>>> Thanks,
>>>  George.
>>> 
>>> 
>>> On Nov 25, 2013, at 15:09 , Meredith, Karl  
>>> wrote:
>>> 
 I do have DYLD_LIBRARY_PATH set to the same paths as LD_LIBRARY_PATH.  
 This does not resolve the problem.  The code still hangs on MPI::Init().
 
 Another thing I tried is I recompiled openmpi with the debug flags 
 activated:
 ./configure --prefix=$HOME/tools/openmpi --enable-debug
 make
 make install
 
 Then, I attached to the running process using gdb.  I tried to do a back 
 trace and see where it was hanging up at, but all I got was this:
 Attaching to process 45231
 Reading symbols from 
 /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx...Reading symbols 
 from 
 /Users/meredithk/tools/openmpi-1.6.5/examples/hello_cxx.dSYM/Contents/Resources/DWARF/hello_cxx...done.
 done.
 0x7fff8c1859aa in ?? ()
 (gdb) bt
 #0  0x7fff8c1859aa in ?? ()
 #1  0x000106b73ea0 in ?? ()
 #2  0x706d6e65706f2f2f 

Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

2013-12-02 Thread Raiden Hasegawa
Thanks, Jeff.  The compiler does in fact work when running the troublesome
line in ./configure. I haven't set either FC, FCFLAGS nor do I have
LD_LIBRARY_PATH set in my .bashrc.  Do you have any thoughts on what
environmental variable may trip this up?

Raiden


On Mon, Dec 2, 2013 at 11:23 AM, Jeff Squyres (jsquyres)  wrote:

> It looks like your Fortran compiler installation is borked.  Have you
> tested with the same test program that configure used?
>
>program main
>
>end
>
> Put that in a simple "conftest.f" file, and try the same invocation line
> that configure used:
>
> /usr/local/bin/gfortran -o conftestconftest.f
>
> Does that work?
>
> If that works and does not yield the same error that configure saw, then
> perhaps there is some environment variable(s) that are/were present when
> you run configure that are not present when you try the test manually...?
>
>
> On Dec 1, 2013, at 8:51 AM, Raiden Hasegawa 
> wrote:
>
> > Hi All, new to the list here.  I'm running into an error while trying to
> configure:
> >
> > shell$ ./configure --prefix=/usr/local/Cellar/open-mpi/1.7.3
> --disable-silent-rules --enable-ipv6
> >
> > Here is a blurb from the config.log (which I have attached as well):
> >
> > configure:29606: checking if Fortran compiler works
> > configure:29635: /usr/local/bin/gfortran -o conftestconftest.f  >&5
> > Undefined symbols for architecture x86_64:
> >   "__gfortran_set_options", referenced from:
> >   _main in cccSAmNO.o
> > ld: symbol(s) not found for architecture x86_64
> > collect2: error: ld returned 1 exit status
> > configure:29635: $? = 1
> > configure: program exited with status 1
> > configure: failed program was:
> > |   program main
> > |
> > |   end
> > configure:29651: result: no
> > configure:29665: error: Could not run a simple Fortran program.
>  Aborting.
> >
> > I have tested my gfortran compiler on some simple "Hello World" programs
> and it works just fine.  I am having trouble diagnosing the problem and any
> and all help would be appreciated.  Here are my specs:
> >
> > mac os x 10.8.4
> > gcc and gfortran 4.8.2 (both installed using homebrew)
> > open-mpi 1.7.3
> >
> > Best,
> >
> > Raiden
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

2013-12-02 Thread Jeff Squyres (jsquyres)
On Dec 2, 2013, at 3:00 PM, Raiden Hasegawa  wrote:

> Thanks, Jeff.  The compiler does in fact work when running the troublesome 
> line in ./configure.

Errr... I'm not sure how to parse that.  The config.log you cited shows that 
the compiler does *not* work in configure:

-
configure:29606: checking if Fortran compiler works
configure:29635: /usr/local/bin/gfortran -o conftestconftest.f  >&5
Undefined symbols for architecture x86_64:
  "__gfortran_set_options", referenced from:
  _main in cccSAmNO.o
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
configure:29635: $? = 1
configure: program exited with status 1
configure: failed program was:
|   program main
|
|   end
configure:29651: result: no
configure:29665: error: Could not run a simple Fortran program.  Aborting.
-

Did you typo and mean that the compiler does work when outside of configure, 
and fails when it is inside configure?

> I haven't set either FC, FCFLAGS nor do I have LD_LIBRARY_PATH set in my 
> .bashrc.  Do you have any thoughts on what environmental variable may trip 
> this up?

Do you have DYLD_LIBRARY_PATH set?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

2013-12-02 Thread Raiden Hasegawa
Yes, what I meant is that when running:

/usr/local/bin/gfortran -o conftestconftest.f

outside of configure it does work.

I don't think I have DYLD_LIBRARY_PATH set, but I will check when I get
back to my home computer.


On Mon, Dec 2, 2013 at 3:47 PM, Jeff Squyres (jsquyres)
wrote:

> On Dec 2, 2013, at 3:00 PM, Raiden Hasegawa 
> wrote:
>
> > Thanks, Jeff.  The compiler does in fact work when running the
> troublesome line in ./configure.
>
> Errr... I'm not sure how to parse that.  The config.log you cited shows
> that the compiler does *not* work in configure:
>
> -
> configure:29606: checking if Fortran compiler works
> configure:29635: /usr/local/bin/gfortran -o conftestconftest.f  >&5
> Undefined symbols for architecture x86_64:
>   "__gfortran_set_options", referenced from:
>   _main in cccSAmNO.o
> ld: symbol(s) not found for architecture x86_64
> collect2: error: ld returned 1 exit status
> configure:29635: $? = 1
> configure: program exited with status 1
> configure: failed program was:
> |   program main
> |
> |   end
> configure:29651: result: no
> configure:29665: error: Could not run a simple Fortran program.  Aborting.
> -
>
> Did you typo and mean that the compiler does work when outside of
> configure, and fails when it is inside configure?
>
> > I haven't set either FC, FCFLAGS nor do I have LD_LIBRARY_PATH set in my
> .bashrc.  Do you have any thoughts on what environmental variable may trip
> this up?
>
> Do you have DYLD_LIBRARY_PATH set?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

2013-12-02 Thread Jeff Squyres (jsquyres)
I did notice that you have an oddity:

- I see /usr/local/opt/gfortran/bin in your PATH (line 41 in config.log)
- I see that configure is invoking /usr/local/bin/gfortran (line 7630 and 
elsewhere in config.log)

That implies that you have 2 different gfortrans installed on your machine, one 
of which may be faulty, or may accidentally be referring to the libraries of 
the other (therefore resulting in Badness).



On Dec 2, 2013, at 3:52 PM, Raiden Hasegawa  wrote:

> Yes, what I meant is that when running:
> 
> /usr/local/bin/gfortran -o conftestconftest.f 
> 
> outside of configure it does work.
> 
> I don't think I have DYLD_LIBRARY_PATH set, but I will check when I get back 
> to my home computer.
> 
> 
> On Mon, Dec 2, 2013 at 3:47 PM, Jeff Squyres (jsquyres)  
> wrote:
> On Dec 2, 2013, at 3:00 PM, Raiden Hasegawa  wrote:
> 
> > Thanks, Jeff.  The compiler does in fact work when running the troublesome 
> > line in ./configure.
> 
> Errr... I'm not sure how to parse that.  The config.log you cited shows that 
> the compiler does *not* work in configure:
> 
> -
> configure:29606: checking if Fortran compiler works
> configure:29635: /usr/local/bin/gfortran -o conftestconftest.f  >&5
> Undefined symbols for architecture x86_64:
>   "__gfortran_set_options", referenced from:
>   _main in cccSAmNO.o
> ld: symbol(s) not found for architecture x86_64
> collect2: error: ld returned 1 exit status
> configure:29635: $? = 1
> configure: program exited with status 1
> configure: failed program was:
> |   program main
> |
> |   end
> configure:29651: result: no
> configure:29665: error: Could not run a simple Fortran program.  Aborting.
> -
> 
> Did you typo and mean that the compiler does work when outside of configure, 
> and fails when it is inside configure?
> 
> > I haven't set either FC, FCFLAGS nor do I have LD_LIBRARY_PATH set in my 
> > .bashrc.  Do you have any thoughts on what environmental variable may trip 
> > this up?
> 
> Do you have DYLD_LIBRARY_PATH set?
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

2013-12-02 Thread Nathan Hjelm
Ack, forgot about that. There is a bug in 1.7.3 that breaks one of LANL's 
default
settings. Just change the line in 
contrib/platform/lanl/cray_xe6/optimized-common

from:

enable_orte_static_ports=no

to:

enable_orte_static_ports=yes


That should work.

-Nathan

On Wed, Nov 27, 2013 at 08:05:48PM +, Teranishi, Keita wrote:
> Nathan,
> 
> I got a compile-time error (see below).  I use a script from
> contrib/platform/lanl/cray_xe6 with gcc-4.7.2.  Is there any problem in my
> environment?
> 
> Thanks,
> Keita 
> 
> CC   oob_tcp.lo
> oob_tcp.c:353:7: error: expected identifier or '(' before 'else'
> oob_tcp.c:358:5: warning: data definition has no type or storage class
> [enabled by default]
> oob_tcp.c:358:5: warning: type defaults to 'int' in declaration of
> 'mca_oob_tcp_ipv4_dynamic_ports' [enabled by default]
> oob_tcp.c:358:5: error: conflicting types for
> 'mca_oob_tcp_ipv4_dynamic_ports'
> oob_tcp.c:140:14: note: previous definition of
> 'mca_oob_tcp_ipv4_dynamic_ports' was here
> oob_tcp.c:358:38: warning: initialization makes integer from pointer
> without a cast [enabled by default]
> oob_tcp.c:359:6: error: expected identifier or '(' before 'void'
> oob_tcp.c:367:5: error: expected identifier or '(' before 'if'
> oob_tcp.c:380:7: error: expected identifier or '(' before 'else'
> oob_tcp.c:384:26: error: expected '=', ',', ';', 'asm' or '__attribute__'
> before '.' token
> oob_tcp.c:385:30: error: expected declaration specifiers or '...' before
> string constant
> oob_tcp.c:385:48: error: expected declaration specifiers or '...' before
> 'disable_family_values'
> oob_tcp.c:385:71: error: expected declaration specifiers or '...' before
> '&' token
> oob_tcp.c:386:6: error: expected identifier or '(' before 'void'
> oob_tcp.c:391:5: error: expected identifier or '(' before 'do'
> oob_tcp.c:391:5: error: expected identifier or '(' before 'while'
> oob_tcp.c:448:5: error: expected identifier or '(' before 'return'
> oob_tcp.c:449:1: error: expected identifier or '(' before '}' token
> make[2]: *** [oob_tcp.lo] Error 1
> make[2]: Leaving directory
> `/ufs/home/knteran/openmpi-1.7.3/orte/mca/oob/tcp'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte'
> 
> 
> 
> 
> 
> On 11/26/13 3:54 PM, "Nathan Hjelm"  wrote:
> 
> >Alright, everything is identical to Cielito but it looks like you are
> >getting
> >bad data from alps.
> >
> >I think we changed some of the alps parsing for 1.7.3. Can you give that
> >version a try and let me know if it resolves your issue. If not I can add
> >better debugging to the ras/alps module.
> >
> >-Nathan
> >
> >On Tue, Nov 26, 2013 at 11:50:00PM +, Teranishi, Keita wrote:
> >> Here is what we can see:
> >> 
> >> knteran@mzlogin01e:~> ls -l /opt/cray/xe-sysroot
> >> total 8
> >> drwxr-xr-x 6 bin  bin  4096 2012-02-04 11:05
> >>4.0.36.securitypatch.20111221
> >> drwxr-xr-x 6 bin  bin  4096 2013-01-11 15:17 4.1.40
> >> lrwxrwxrwx 1 root root6 2013-01-11 15:19 default -> 4.1.40
> >> 
> >> Thanks,
> >> Keita
> >> 
> >> 
> >> 
> >> 
> >> On 11/26/13 3:19 PM, "Nathan Hjelm"  wrote:
> >> 
> >> >??? Alps reports that the two nodes each have one slot. What PE release
> >> >are you using. A quick way to find out is ls -l /opt/cray/xe-sysroot on
> >> >the
> >> >external login node (this directory does not exist on the internal
> >>login
> >> >nodes.)
> >> >
> >> >-Nathan
> >> >
> >> >On Tue, Nov 26, 2013 at 11:07:36PM +, Teranishi, Keita wrote:
> >> >> Nathan,
> >> >> 
> >> >> Here it is.
> >> >> 
> >> >> Keita
> >> >> 
> >> >> 
> >> >> 
> >> >> 
> >> >> 
> >> >> On 11/26/13 3:02 PM, "Nathan Hjelm"  wrote:
> >> >> 
> >> >> >Ok, that sheds a little more light on the situation. For some
> >>reason it
> >> >> >sees 2 nodes
> >> >> >apparently with one slot each. One more set out outputs would be
> >> >>helpful.
> >> >> >Please run
> >> >> >with -mca ras_base_verbose 100 . That way I can see what was read
> >>from
> >> >> >alps.
> >> >> >
> >> >> >-Nathan
> >> >> >
> >> >> >On Tue, Nov 26, 2013 at 10:14:11PM +, Teranishi, Keita wrote:
> >> >> >> Nathan,
> >> >> >> 
> >> >> >> I am hoping these files would help you.
> >> >> >> 
> >> >> >> Thanks,
> >> >> >> Keita
> >> >> >> 
> >> >> >> 
> >> >> >> 
> >> >> >> On 11/26/13 1:41 PM, "Nathan Hjelm"  wrote:
> >> >> >> 
> >> >> >> >Well, no hints as to the error there. Looks identical to the
> >>output
> >> >>on
> >> >> >>my
> >> >> >> >XE-6. How
> >> >> >> >about setting -mca rmaps_base_verbose 100 . See what is going on
> >> >>with
> >> >> >>the
> >> >> >> >mapper.
> >> >> >> >
> >> >> >> >-Nathan Hjelm
> >> >> >> >Application Readiness, HPC-5, LANL
> >> >> >> >
> >> >> >> >On Tue, Nov 26, 2013 at 09:33:20PM +, Teranishi, Keita wrote:
> >> >> >> >> Nathan,
> >> >> >> >> 
> >> >> >> >> Please see the attached obtained from two cases (-np 2 and -np
> >>4).
> >> >> >> >> 
> >> >> >> >> Thanks,
> >> >> >> >> 
> >> >> >> 
> >> >> 
> >> 
> 

Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple

2013-12-02 Thread Jeff Squyres (jsquyres)
I'm joining this thread late, but I think I know what is going on:

- I am able to replicate the hang with 1.7.3 on Mavericks (with threading 
enabled, etc.)
- I notice that the hang has disappeared at the 1.7.x branch head (also on 
Mavericks)

Meaning: can you try with the latest 1.7.x nightly tarball and verify that the 
problem disappears for you?  See http://www.open-mpi.org/nightly/v1.7/

Ralph recently brought over a major ORTE control message change to the 1.7.x 
branch (after 1.7.3 was released) that -- skipping lots of details -- changes 
how the shared memory bootstrapping works.  Based on the stack traces you sent 
and the ones I was also able to get, I'm thinking that Ralph's big ORTE change 
fixes this issue.



On Nov 25, 2013, at 10:52 PM, Dominique Orban  wrote:

> 
> On 2013-11-25, at 9:02 PM, Ralph Castain  wrote:
> 
>> On Nov 25, 2013, at 5:04 PM, Pierre Jolivet  wrote:
>> 
>>> 
>>> On Nov 24, 2013, at 3:03 PM, Jed Brown  wrote:
>>> 
 Ralph Castain  writes:
 
> Given that we have no idea what Homebrew uses, I don't know how we
> could clarify/respond.
 
>>> 
>>> Ralph, it is pretty easy to know what Homebrew uses, c.f. 
>>> https://github.com/mxcl/homebrew/blob/master/Library/Formula/open-mpi.rb 
>>> (sorry if you meant something else).
>> 
>> Might be a surprise, but I don't track all these guys :-)
>> 
>> Homebrew is new to me
>> 
>>> 
 Pierre provided a link to MacPorts saying that all of the following
 options were needed to properly enable threads.
 
 --enable-event-thread-support --enable-opal-multi-threads 
 --enable-orte-progress-threads --enable-mpi-thread-multiple
 
 If that is indeed the case, and if passing some subset of these options
 results in deadlock, it's not exactly user-friendly.
 
 Maybe --enable-mpi-thread-multiple is enough, in which case MacPorts is
 doing something needlessly complicated and Pierre's link was a red
 herring?
>>> 
>>> That is very likely, though on the other hand, Homebrew is doing something 
>>> pretty straightforward. I just wanted a quick and easy fix back when I had 
>>> the same hanging issue, but there should be a better explanation if 
>>> --enable-mpi-thread-multiple is indeed enough.
>> 
>> It is enough - we set all required things internally
> 
> Is that for sure? My original message originates from a hang in the PETSc 
> tests and I get quite different results depending on whether I compile 
> OpenMPI with --enable-mpi-thread-multiple only or not.
> 
> I recompiled PETSc with debugging enabled against OpenMPI built with the 
> "correct" flags mentioned by Pierre, and this the stack trace I get:
> 
> $ mpirun -n 2 xterm -e gdb ./ex5
> 
>   ^C
>   Program received signal SIGINT, Interrupt.
>   0x7fff991160fa in __psynch_cvwait ()
>  from /usr/lib/system/libsystem_kernel.dylib
>   (gdb) where
>   #0  0x7fff991160fa in __psynch_cvwait ()
>  from /usr/lib/system/libsystem_kernel.dylib
>   #1  0x7fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
> 
> 
>   ^C
>   Program received signal SIGINT, Interrupt.
>   0x7fff991160fa in __psynch_cvwait ()
>  from /usr/lib/system/libsystem_kernel.dylib
>   (gdb) where
>   #0  0x7fff991160fa in __psynch_cvwait ()
>  from /usr/lib/system/libsystem_kernel.dylib
>   #1  0x7fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
> 
> 
> If I recompile PETSc against OpenMPI built with --enable-mpi-thread-multiple 
> only (leaving out the other flags, which Pierre suggested is wrong), I get 
> the following traces:
> 
>   ^C
>   Program received signal SIGINT, Interrupt.
>   0x7fff991160fa in __psynch_cvwait ()
>  from /usr/lib/system/libsystem_kernel.dylib
>   (gdb) where
>   #0  0x7fff991160fa in __psynch_cvwait ()
>  from /usr/lib/system/libsystem_kernel.dylib
>   #1  0x7fff98d6ffb9 in ?? () from /usr/lib/system/libsystem_c.dylib
> 
> 
>   ^C
>   Program received signal SIGINT, Interrupt.
>   0x000101edca28 in mca_common_sm_init ()
>  from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
>   (gdb) where
>   #0  0x000101edca28 in mca_common_sm_init ()
>  from /usr/local/Cellar/open-mpi/1.7.3/lib/libmca_common_sm.4.dylib
>   #1  0x000101ed8a38 in mca_mpool_sm_init ()
>  from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_mpool_sm.so
>   #2  0x000101c383fa in mca_mpool_base_module_create ()
>  from /usr/local/lib/libmpi.1.dylib
>   #3  0x000102933b41 in mca_btl_sm_add_procs ()
>  from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_btl_sm.so
>   #4  0x000102929dfb in mca_bml_r2_add_procs ()
>  from /usr/local/Cellar/open-mpi/1.7.3/lib/openmpi/mca_bml_r2.so
>   #5  0x00010290a59c in mca_pml_ob1_add_procs ()
>  from /usr/local/Cellar/open-m

Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

2013-12-02 Thread Teranishi, Keita
Nathan,

It is working!

Thanks,
---
--
Keita Teranishi
Principal Member of Technical Staff
Scalable Modeling and Analysis Systems
Sandia National Laboratories
Livermore, CA 94551
+1 (925) 294-3738





On 12/2/13 2:28 PM, "Nathan Hjelm"  wrote:

>Ack, forgot about that. There is a bug in 1.7.3 that breaks one of LANL's
>default
>settings. Just change the line in
>contrib/platform/lanl/cray_xe6/optimized-common
>
>from:
>
>enable_orte_static_ports=no
>
>to:
>
>enable_orte_static_ports=yes
>
>
>That should work.
>
>-Nathan
>
>On Wed, Nov 27, 2013 at 08:05:48PM +, Teranishi, Keita wrote:
>> Nathan,
>> 
>> I got a compile-time error (see below).  I use a script from
>> contrib/platform/lanl/cray_xe6 with gcc-4.7.2.  Is there any problem in
>>my
>> environment?
>> 
>> Thanks,
>> Keita 
>> 
>> CC   oob_tcp.lo
>> oob_tcp.c:353:7: error: expected identifier or '(' before 'else'
>> oob_tcp.c:358:5: warning: data definition has no type or storage class
>> [enabled by default]
>> oob_tcp.c:358:5: warning: type defaults to 'int' in declaration of
>> 'mca_oob_tcp_ipv4_dynamic_ports' [enabled by default]
>> oob_tcp.c:358:5: error: conflicting types for
>> 'mca_oob_tcp_ipv4_dynamic_ports'
>> oob_tcp.c:140:14: note: previous definition of
>> 'mca_oob_tcp_ipv4_dynamic_ports' was here
>> oob_tcp.c:358:38: warning: initialization makes integer from pointer
>> without a cast [enabled by default]
>> oob_tcp.c:359:6: error: expected identifier or '(' before 'void'
>> oob_tcp.c:367:5: error: expected identifier or '(' before 'if'
>> oob_tcp.c:380:7: error: expected identifier or '(' before 'else'
>> oob_tcp.c:384:26: error: expected '=', ',', ';', 'asm' or
>>'__attribute__'
>> before '.' token
>> oob_tcp.c:385:30: error: expected declaration specifiers or '...' before
>> string constant
>> oob_tcp.c:385:48: error: expected declaration specifiers or '...' before
>> 'disable_family_values'
>> oob_tcp.c:385:71: error: expected declaration specifiers or '...' before
>> '&' token
>> oob_tcp.c:386:6: error: expected identifier or '(' before 'void'
>> oob_tcp.c:391:5: error: expected identifier or '(' before 'do'
>> oob_tcp.c:391:5: error: expected identifier or '(' before 'while'
>> oob_tcp.c:448:5: error: expected identifier or '(' before 'return'
>> oob_tcp.c:449:1: error: expected identifier or '(' before '}' token
>> make[2]: *** [oob_tcp.lo] Error 1
>> make[2]: Leaving directory
>> `/ufs/home/knteran/openmpi-1.7.3/orte/mca/oob/tcp'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte'
>> 
>> 
>> 
>> 
>> 
>> On 11/26/13 3:54 PM, "Nathan Hjelm"  wrote:
>> 
>> >Alright, everything is identical to Cielito but it looks like you are
>> >getting
>> >bad data from alps.
>> >
>> >I think we changed some of the alps parsing for 1.7.3. Can you give
>>that
>> >version a try and let me know if it resolves your issue. If not I can
>>add
>> >better debugging to the ras/alps module.
>> >
>> >-Nathan
>> >
>> >On Tue, Nov 26, 2013 at 11:50:00PM +, Teranishi, Keita wrote:
>> >> Here is what we can see:
>> >> 
>> >> knteran@mzlogin01e:~> ls -l /opt/cray/xe-sysroot
>> >> total 8
>> >> drwxr-xr-x 6 bin  bin  4096 2012-02-04 11:05
>> >>4.0.36.securitypatch.20111221
>> >> drwxr-xr-x 6 bin  bin  4096 2013-01-11 15:17 4.1.40
>> >> lrwxrwxrwx 1 root root6 2013-01-11 15:19 default -> 4.1.40
>> >> 
>> >> Thanks,
>> >> Keita
>> >> 
>> >> 
>> >> 
>> >> 
>> >> On 11/26/13 3:19 PM, "Nathan Hjelm"  wrote:
>> >> 
>> >> >??? Alps reports that the two nodes each have one slot. What PE
>>release
>> >> >are you using. A quick way to find out is ls -l
>>/opt/cray/xe-sysroot on
>> >> >the
>> >> >external login node (this directory does not exist on the internal
>> >>login
>> >> >nodes.)
>> >> >
>> >> >-Nathan
>> >> >
>> >> >On Tue, Nov 26, 2013 at 11:07:36PM +, Teranishi, Keita wrote:
>> >> >> Nathan,
>> >> >> 
>> >> >> Here it is.
>> >> >> 
>> >> >> Keita
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> On 11/26/13 3:02 PM, "Nathan Hjelm"  wrote:
>> >> >> 
>> >> >> >Ok, that sheds a little more light on the situation. For some
>> >>reason it
>> >> >> >sees 2 nodes
>> >> >> >apparently with one slot each. One more set out outputs would be
>> >> >>helpful.
>> >> >> >Please run
>> >> >> >with -mca ras_base_verbose 100 . That way I can see what was read
>> >>from
>> >> >> >alps.
>> >> >> >
>> >> >> >-Nathan
>> >> >> >
>> >> >> >On Tue, Nov 26, 2013 at 10:14:11PM +, Teranishi, Keita wrote:
>> >> >> >> Nathan,
>> >> >> >> 
>> >> >> >> I am hoping these files would help you.
>> >> >> >> 
>> >> >> >> Thanks,
>> >> >> >> Keita
>> >> >> >> 
>> >> >> >> 
>> >> >> >> 
>> >> >> >> On 11/26/13 1:41 PM, "Nathan Hjelm"  wrote:
>> >> >> >> 
>> >> >> >> >Well, no hints as to the error there. Looks identical to the
>> >>output
>> >> >>on
>> >> >> >>my
>> >> >> >> >XE-6. How
>> >> >> >> >about setting -mca rmaps_base_verbose 100 

Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

2013-12-02 Thread Ralph Castain
FWIW: that has been fixed with the current head of the 1.7 branch (will be
in 1.7.4 release)



On Mon, Dec 2, 2013 at 2:28 PM, Nathan Hjelm  wrote:

> Ack, forgot about that. There is a bug in 1.7.3 that breaks one of LANL's
> default
> settings. Just change the line in
> contrib/platform/lanl/cray_xe6/optimized-common
>
> from:
>
> enable_orte_static_ports=no
>
> to:
>
> enable_orte_static_ports=yes
>
>
> That should work.
>
> -Nathan
>
> On Wed, Nov 27, 2013 at 08:05:48PM +, Teranishi, Keita wrote:
> > Nathan,
> >
> > I got a compile-time error (see below).  I use a script from
> > contrib/platform/lanl/cray_xe6 with gcc-4.7.2.  Is there any problem in
> my
> > environment?
> >
> > Thanks,
> > Keita
> >
> > CC   oob_tcp.lo
> > oob_tcp.c:353:7: error: expected identifier or '(' before 'else'
> > oob_tcp.c:358:5: warning: data definition has no type or storage class
> > [enabled by default]
> > oob_tcp.c:358:5: warning: type defaults to 'int' in declaration of
> > 'mca_oob_tcp_ipv4_dynamic_ports' [enabled by default]
> > oob_tcp.c:358:5: error: conflicting types for
> > 'mca_oob_tcp_ipv4_dynamic_ports'
> > oob_tcp.c:140:14: note: previous definition of
> > 'mca_oob_tcp_ipv4_dynamic_ports' was here
> > oob_tcp.c:358:38: warning: initialization makes integer from pointer
> > without a cast [enabled by default]
> > oob_tcp.c:359:6: error: expected identifier or '(' before 'void'
> > oob_tcp.c:367:5: error: expected identifier or '(' before 'if'
> > oob_tcp.c:380:7: error: expected identifier or '(' before 'else'
> > oob_tcp.c:384:26: error: expected '=', ',', ';', 'asm' or '__attribute__'
> > before '.' token
> > oob_tcp.c:385:30: error: expected declaration specifiers or '...' before
> > string constant
> > oob_tcp.c:385:48: error: expected declaration specifiers or '...' before
> > 'disable_family_values'
> > oob_tcp.c:385:71: error: expected declaration specifiers or '...' before
> > '&' token
> > oob_tcp.c:386:6: error: expected identifier or '(' before 'void'
> > oob_tcp.c:391:5: error: expected identifier or '(' before 'do'
> > oob_tcp.c:391:5: error: expected identifier or '(' before 'while'
> > oob_tcp.c:448:5: error: expected identifier or '(' before 'return'
> > oob_tcp.c:449:1: error: expected identifier or '(' before '}' token
> > make[2]: *** [oob_tcp.lo] Error 1
> > make[2]: Leaving directory
> > `/ufs/home/knteran/openmpi-1.7.3/orte/mca/oob/tcp'
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte'
> >
> >
> >
> >
> >
> > On 11/26/13 3:54 PM, "Nathan Hjelm"  wrote:
> >
> > >Alright, everything is identical to Cielito but it looks like you are
> > >getting
> > >bad data from alps.
> > >
> > >I think we changed some of the alps parsing for 1.7.3. Can you give that
> > >version a try and let me know if it resolves your issue. If not I can
> add
> > >better debugging to the ras/alps module.
> > >
> > >-Nathan
> > >
> > >On Tue, Nov 26, 2013 at 11:50:00PM +, Teranishi, Keita wrote:
> > >> Here is what we can see:
> > >>
> > >> knteran@mzlogin01e:~> ls -l /opt/cray/xe-sysroot
> > >> total 8
> > >> drwxr-xr-x 6 bin  bin  4096 2012-02-04 11:05
> > >>4.0.36.securitypatch.20111221
> > >> drwxr-xr-x 6 bin  bin  4096 2013-01-11 15:17 4.1.40
> > >> lrwxrwxrwx 1 root root6 2013-01-11 15:19 default -> 4.1.40
> > >>
> > >> Thanks,
> > >> Keita
> > >>
> > >>
> > >>
> > >>
> > >> On 11/26/13 3:19 PM, "Nathan Hjelm"  wrote:
> > >>
> > >> >??? Alps reports that the two nodes each have one slot. What PE
> release
> > >> >are you using. A quick way to find out is ls -l /opt/cray/xe-sysroot
> on
> > >> >the
> > >> >external login node (this directory does not exist on the internal
> > >>login
> > >> >nodes.)
> > >> >
> > >> >-Nathan
> > >> >
> > >> >On Tue, Nov 26, 2013 at 11:07:36PM +, Teranishi, Keita wrote:
> > >> >> Nathan,
> > >> >>
> > >> >> Here it is.
> > >> >>
> > >> >> Keita
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> On 11/26/13 3:02 PM, "Nathan Hjelm"  wrote:
> > >> >>
> > >> >> >Ok, that sheds a little more light on the situation. For some
> > >>reason it
> > >> >> >sees 2 nodes
> > >> >> >apparently with one slot each. One more set out outputs would be
> > >> >>helpful.
> > >> >> >Please run
> > >> >> >with -mca ras_base_verbose 100 . That way I can see what was read
> > >>from
> > >> >> >alps.
> > >> >> >
> > >> >> >-Nathan
> > >> >> >
> > >> >> >On Tue, Nov 26, 2013 at 10:14:11PM +, Teranishi, Keita wrote:
> > >> >> >> Nathan,
> > >> >> >>
> > >> >> >> I am hoping these files would help you.
> > >> >> >>
> > >> >> >> Thanks,
> > >> >> >> Keita
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On 11/26/13 1:41 PM, "Nathan Hjelm"  wrote:
> > >> >> >>
> > >> >> >> >Well, no hints as to the error there. Looks identical to the
> > >>output
> > >> >>on
> > >> >> >>my
> > >> >> >> >XE-6. How
> > >> >> >> >about setting -mca rmaps_base_verbose 100 . See what is going
> on
> > >> >>with
> > >> >> >>the
> > >> >>

[OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple -- part II

2013-12-02 Thread Eric Chamberland

Hi,

I just open a new "chapter" with the same subject. ;-)

We are using OpenMPI 1.6.5 (compiled with --enable-thread-multiple) with 
Petsc 3.4.3 (on colosse supercomputer: 
http://www.calculquebec.ca/en/resources/compute-servers/colosse). We 
observed a deadlock with threads within the openib btl.


We successfully bypassed the deadlock by 2 different ways:

#1- launching the code with "--mca btl ^openib"

#2- compiling OpenMPI 1.6.5 *without* the "--enable-thread-multiple" option.

When the code hangs, here are some backtraces (on different processes) 
that we got:


#0  0x7fb4a6a03795 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x7fb49db7ea7b in ?? () from /usr/lib64/libmlx4-rdmav2.so
#2  0x7fb4a878d469 in ibv_poll_cq () at
/usr/include/infiniband/verbs.h:884
#3  poll_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3563 


#4  progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3694 


#5  btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 


#6  0x7fb4a8973d32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#7  0x7fb4a87404f0 in opal_condition_wait (count=25695904,
requests=0x100, statuses=0x7fff9b7f1320) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#8  ompi_request_default_wait_all (count=25695904, requests=0x100,
statuses=0x7fff9b7f1320) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263




#0  0x7f731d1100b8 in pthread_mutex_unlock () from
/lib64/libpthread.so.0
#1  0x7f731ee9b3b7 in opal_mutex_unlock () at
../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123
#2  progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688 


#3  btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 


#4  0x7f731f081d32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#5  0x7f731ee4e4f0 in opal_condition_wait (count=25649104,
requests=0x0, statuses=0x1875fd0) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#6  ompi_request_default_wait_all (count=25649104, requests=0x0,
statuses=0x1875fd0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#7  0x7f731eec2644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1875fd0,
rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc, op=0x1875fd0,
comm=0x5e80, module=0xca4ec20)
  at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_allreduce.c:223
#8  0x7f731eebe2ec in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x1875fd0, rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc,
op=0x1875fd0, comm=0x5e80, module=0x159d8330)
  at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61 


#9  0x7f731ee5cad9 in PMPI_Allreduce (sendbuf=0x1875fd0,
recvbuf=0x0, count=25649104, datatype=0x7f72ce8f80fc, op=0x1875fd0,
comm=0x5e80) at pallreduce.c:105



#0  opal_progress () at 
../../openmpi-1.6.5/opal/runtime/opal_progress.c:206

#1  0x7f8e3d8844f0 in opal_condition_wait (count=0, requests=0x0,
statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/opal/threads/condition.h:92
#2  ompi_request_default_wait_all (count=0, requests=0x0,
statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#3  0x7f8e3d8f8644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x0, rbuf=0x0,
count=1037994528, dtype=0x1, op=0x0, comm=0x60bb, module=0xcb86ce0)
  at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_allreduce.c:223
#4  0x7f8e3d8f42ec in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x0, rbuf=0x0, count=1037994528, dtype=0x1, op=0x0, comm=0x60bb,
module=0x171d59a0)
  at
../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61 


#5  0x7f8e3d892ad9 in PMPI_Allreduce (sendbuf=0x0, recvbuf=0x0,
count=1037994528, datatype=0x1, op=0x0, comm=0x60bb) at pallreduce.c:105



#0  0x7f7ef7d0b258 in pthread_mutex_lock@plt () from
/software/MPI/openmpi/1.6.5_intel/lib/libmpi.so.1
#1  0x7f7ef7d72377 in opal_mutex_lock () at
../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:109
#2  progress_one_device () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3650 


#3  btl_openib_component_progress () at
../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719 


#4  0x7f7ef7f58d32 in opal_progress () at
../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
#5  0x7f7ef7d254f0 in opal_condition_wait (count=25625488,
requests=0x0, statuses=0x7f7ef8324208) at
../../openmpi-1.6.5/opal/threads/condition.h:92
#6  ompi_request_default_wait_all (count=25625488, requests=0x0,
statuses=0x7f7ef8324208) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
#7  0x7f7ef7d99644 in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1870390,
rbuf=0x0, count=-130924024,

Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple -- part II

2013-12-02 Thread Ralph Castain
No surprise there - that's known behavior. As has been said, we hope to
extend the thread-multiple support in the 1.9 series.



On Mon, Dec 2, 2013 at 6:33 PM, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Hi,
>
> I just open a new "chapter" with the same subject. ;-)
>
> We are using OpenMPI 1.6.5 (compiled with --enable-thread-multiple) with
> Petsc 3.4.3 (on colosse supercomputer: http://www.calculquebec.ca/en/
> resources/compute-servers/colosse). We observed a deadlock with threads
> within the openib btl.
>
> We successfully bypassed the deadlock by 2 different ways:
>
> #1- launching the code with "--mca btl ^openib"
>
> #2- compiling OpenMPI 1.6.5 *without* the "--enable-thread-multiple"
> option.
>
> When the code hangs, here are some backtraces (on different processes)
> that we got:
>
> #0  0x7fb4a6a03795 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1  0x7fb49db7ea7b in ?? () from /usr/lib64/libmlx4-rdmav2.so
> #2  0x7fb4a878d469 in ibv_poll_cq () at
> /usr/include/infiniband/verbs.h:884
> #3  poll_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3563
>
> #4  progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3694
>
> #5  btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #6  0x7fb4a8973d32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #7  0x7fb4a87404f0 in opal_condition_wait (count=25695904,
> requests=0x100, statuses=0x7fff9b7f1320) at
> ../../openmpi-1.6.5/opal/threads/condition.h:92
> #8  ompi_request_default_wait_all (count=25695904, requests=0x100,
> statuses=0x7fff9b7f1320) at ../../openmpi-1.6.5/ompi/
> request/req_wait.c:263
>
>
>
>
> #0  0x7f731d1100b8 in pthread_mutex_unlock () from
> /lib64/libpthread.so.0
> #1  0x7f731ee9b3b7 in opal_mutex_unlock () at
> ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:123
> #2  progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3688
>
> #3  btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #4  0x7f731f081d32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #5  0x7f731ee4e4f0 in opal_condition_wait (count=25649104,
> requests=0x0, statuses=0x1875fd0) at
> ../../openmpi-1.6.5/opal/threads/condition.h:92
> #6  ompi_request_default_wait_all (count=25649104, requests=0x0,
> statuses=0x1875fd0) at ../../openmpi-1.6.5/ompi/request/req_wait.c:263
> #7  0x7f731eec2644 in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x1875fd0,
> rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc, op=0x1875fd0,
> comm=0x5e80, module=0xca4ec20)
>   at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_
> tuned_allreduce.c:223
> #8  0x7f731eebe2ec in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x1875fd0, rbuf=0x0, count=25649104, dtype=0x7f72ce8f80fc,
> op=0x1875fd0, comm=0x5e80, module=0x159d8330)
>   at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61
>
> #9  0x7f731ee5cad9 in PMPI_Allreduce (sendbuf=0x1875fd0,
> recvbuf=0x0, count=25649104, datatype=0x7f72ce8f80fc, op=0x1875fd0,
> comm=0x5e80) at pallreduce.c:105
>
>
>
> #0  opal_progress () at ../../openmpi-1.6.5/opal/
> runtime/opal_progress.c:206
> #1  0x7f8e3d8844f0 in opal_condition_wait (count=0, requests=0x0,
> statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/opal/
> threads/condition.h:92
> #2  ompi_request_default_wait_all (count=0, requests=0x0,
> statuses=0x7f8e3dde8a20) at ../../openmpi-1.6.5/ompi/
> request/req_wait.c:263
> #3  0x7f8e3d8f8644 in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x0, rbuf=0x0,
> count=1037994528, dtype=0x1, op=0x0, comm=0x60bb, module=0xcb86ce0)
>   at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_
> tuned_allreduce.c:223
> #4  0x7f8e3d8f42ec in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x0, rbuf=0x0, count=1037994528, dtype=0x1, op=0x0, comm=0x60bb,
> module=0x171d59a0)
>   at
> ../../../../../openmpi-1.6.5/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:61
>
> #5  0x7f8e3d892ad9 in PMPI_Allreduce (sendbuf=0x0, recvbuf=0x0,
> count=1037994528, datatype=0x1, op=0x0, comm=0x60bb) at pallreduce.c:105
>
>
>
> #0  0x7f7ef7d0b258 in pthread_mutex_lock@plt () from
> /software/MPI/openmpi/1.6.5_intel/lib/libmpi.so.1
> #1  0x7f7ef7d72377 in opal_mutex_lock () at
> ../../../../../openmpi-1.6.5/opal/threads/mutex_unix.h:109
> #2  progress_one_device () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3650
>
> #3  btl_openib_component_progress () at
> ../../../../../openmpi-1.6.5/ompi/mca/btl/openib/btl_openib_component.c:3719
>
> #4  0x7f7ef7f58d32 in opal_progress () at
> ../../openmpi-1.6.5/opal/runtime/opal_progress.c:207
> #5