Brian and George,

I do not know if the stack trace is complete, but I do not see any mx_* functions called which would indicate a crash inside MX due to multiple threads trying to complete the same request. It does show an assert failed.

Francois, is the stack trace from the MX MTL or BTL? Can you send a small program that reproduces this abort?

Scott


On Jun 11, 2009, at 12:25 PM, Brian Barrett wrote:

Neither the CM PML or the MX MTL has been looked at for thread safety. There's not much code to cause problems in the CM PML. The MX MTL would likely need some work to ensure the restrictions Scott mentioned are met (currently, there's no such guarantee in the MX MTL).

Brian

On Jun 11, 2009, at 10:21 AM, George Bosilca wrote:

The comment on the FAQ (and on the other thread) is only true for some BTLs (TCP, SM and MX). I don't have resources to test for the others BTL, it is their developers responsibility to do the required modifications to make them thread safe.

In addition, I have to confess that I never tested the MTL for thread safety. It is a completely different implementations for the message passing, supposed to map directly on top of the underlying network capabilities. However, there are clearly few places where thread safety should be enforced in the MTL layer, and I don't know if this is the case.

george.

On Jun 11, 2009, at 09:35 , Scott Atchley wrote:

Francois,

For threads, the FAQ has:

http://www.open-mpi.org/faq/?category=supported-systems#thread-support

It mentions that thread support is designed in, but lightly tested. It is also possible that the FAQ is out of date and MPI_THREAD_MULTIPLE is fully supported.

The stack trace below shows:

opal_free()
opal_progress()
MPI_Recv()

I do not know this code, but it may be in the higher level code that calls the BTLs and/or MTLs and it would be a place to see if that code handles the TCP BTL differently than MX BTL/MTL.

MX is thread safe with the caveat that two threads may not try to complete the same request at the same time. This includes calling mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the latter two have match bits and match mask that could complete a request being tested/waited by another thread.

Scott

On Jun 11, 2009, at 6:00 AM, François Trahay wrote:

Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php ), threads are supported in OpenMPI. The program I try to run works with the TCP stack and MX driver is thread-safe, so i guess the problem comes from the MX BTL or MTL.

Francois


Scott Atchley wrote:
Hi Francois,

I am not familiar with the internals of the OMPI code. Are you sure, however, that threads are fully supported yet? I was under the impression that thread support was still partial.

Can anyone else comment?

Scott

On Jun 8, 2009, at 8:43 AM, François Trahay wrote:

Hi,
I'm encountering some issues when running a multithreaded program with
OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
My program (included in the tar.bz2) uses several pthreads that perform ping pongs concurrently (thread #1 uses tag #1, thread #2 uses tag #2, etc.) This program crashes over MX (either btl or mtl) with the following
backtrace:

concurrent_ping_v2: pml_cm_recvreq.c:53:
mca_pml_cm_recv_request_completion: Assertion `0 ==
((mca_pml_cm_thin_recv_request_t*)base_request)- >req_base.req_pml_complete'
failed.
[joe0:01709] *** Process received signal ***
[joe0:01709] *** Process received signal ***
[joe0:01709] Signal: Segmentation fault (11)
[joe0:01709] Signal code: Address not mapped (1)
[joe0:01709] Failing at address: 0x1238949c4
[joe0:01709] Signal: Aborted (6)
[joe0:01709] Signal code:  (-6)
[joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
[joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
[joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
[joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7f5722cb3159]
[joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
[joe0:01709] [ 1]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
[0x7f57238d0a08]
[joe0:01709] [ 2]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
[0x7f57238cf8cc]
[joe0:01709] [ 3]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 0(opal_free+0x4e)
[0x7f57238bdc69]
[joe0:01709] [ 4]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_mtl_mx.so
[0x7f572060b72f]
[joe0:01709] [ 5]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 0(opal_progress+0xbc)
[0x7f57238948e0]
[joe0:01709] [ 6]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f572081145a]
[joe0:01709] [ 7]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f57208113b7]
[joe0:01709] [ 8]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f57208112e7]
[joe0:01709] [ 9]
/home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so. 0(MPI_Recv+0x2bc)
[0x7f5723e07690]
[joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
[joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
[joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
[joe0:01709] *** End of error message ***
[joe0:01709] [ 4]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f57208120bb]
[joe0:01709] [ 5]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_mtl_mx.so
[0x7f572060b80a]
[joe0:01709] [ 6]
/home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so. 0(opal_progress+0xbc)
[0x7f57238948e0]
[joe0:01709] [ 7]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f572081147a]
[joe0:01709] [ 8]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f57208113b7]
[joe0:01709] [ 9]
/home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/ mca_pml_cm.so
[0x7f57208112e7]
[joe0:01709] [10]
/home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so. 0(MPI_Recv+0x2bc)
[0x7f5723e07690]
[joe0:01709] [11] ./concurrent_ping_v2(client+0x123) [0x401404]
[joe0:01709] [12] /lib/libpthread.so.0 [0x7f57240b6faa]
[joe0:01709] [13] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
[joe0:01709] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1709 on node joe0 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------


Any idea ?

Francois Trahay

<bug- report.tar.bz2>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to