Michael,
The MPI standard is quite clear. In order to have a correct and
portable MPI code, you are not allowed to use (void*)0. Use
MPI_BOTTOM instead.
We have plenty of tests which test the exact behavior you describe in
your email. And they all pass. I will take a look at what's happens
but I need either the code or at least the part which create the
datatype.
Thanks,
george.
On Apr 11, 2007, at 3:54 AM, Michael Gauckler wrote:
Dear Open MPI User's and Developers,
I encountered a problem with Open MPI when porting an application,
which successfully ran with LAM MPI and MPICH.
The program produces a segmentation fault (see [1] for the stack
trace) when doing the MPI_Send with the following arguments:
MPI_Send((void *)0, 1, datatype, rank, tag, comm_);
The first argument seems to be wrong at first sight, but is correct
because the argument "datatype" is an MPI_Datatype,
which describes the memory layout of the object to be sent and is
zero-based. The other arguments are as expected: one such object is
sent to rank "rank" with tag "tag" with the help of the
communicator "comm_". The MPI_Datatype is constructed
programmatically from the objects member definitions using
MPI_Type_struct. The MPI types involved are solely
MPI_DOUBLE and MPI_UNSIGNED_INT.
I can reproduce the problem with the stable 1.2 release as well as
the 1.2.1a snapshot of Open MPI.
My OS is Linux with Kernel 2.6.18 (Debian Etch) running on standard
Dual Xeon Hardware with GigE.
I tried to reduce the amount of data sent by excluding some of the
object's members from the transmission. There does not seem to be a
certain member or type which causes the problem. There seems to be
a limit of members/data/size which determines the success of the
call. The "datatype" structure describes the type and location of
approx. 2'000'000 numbers. The data itself is approx. 16MB (2M * 8
bytes/number assuming doubles), which I expect not to cause any
problem to a MPI implementation.
Thank you for hints, ideas or suggestions where the problem could be.
Regards,
Michael
[1]
[head:09133] *** Process received signal ***
[head:09133] Signal: Segmentation fault (11)
[head:09133] Signal code: Address not mapped (1)
[head:09133] Failing at address: 0xb0127475
[head:09133] [ 0] [0xb7f0f440]
[head:09133] [ 1] /usr/lib/libmpi.so.0(ompi_convertor_pack+0x90)
[0xb668f9a0]
[head:09133] [ 2] /usr/lib/openmpi/mca_btl_tcp.so
(mca_btl_tcp_prepare_src+0x210) [0xb56daef0]
[head:09133] [ 3] /usr/lib/openmpi/mca_pml_ob1.so
(mca_pml_ob1_send_request_schedule_exclusive+0x1de) [0xb5726ede]
[head:09133] [ 4] /usr/lib/openmpi/mca_pml_ob1.so [0xb5728238]
[head:09133] [ 5] /usr/lib/openmpi/mca_btl_tcp.so [0xb56ddc65]
[head:09133] [ 6] /usr/lib/libopen-pal.so.0(opal_event_base_loop
+0x462) [0xb65bcf12]
[head:09133] [ 7] /usr/lib/libopen-pal.so.0(opal_event_loop+0x29)
[0xb65bcfd9]
[head:09133] [ 8] /usr/lib/libopen-pal.so.0(opal_progress+0xc0)
[0xb65b7260]
[head:09133] [ 9] /usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send
+0x3e5) [0xb571f965]
[head:09133] [10] /usr/lib/libmpi.so.0(MPI_Send+0x12f) [0xb66abf0f]
[head:09133] [11] /opt/plato/release_1.0/bin/engine
(_ZN2GP15MPIProcessGroup4sendERKNS_9MemoryMapEii+0xd9) [0x81cec03]
[head:09133] [12] /opt/plato/release_1.0/bin/engine
(_ZN2GP15MPIProcessGroup4sendEN5boost10shared_ptrINS_6EntityEEEii
+0x2d0) [0x81d0358]
[head:09133] [13] /opt/plato/release_1.0/bin/engine
(_ZN2GP20ParallelDataAccessor4loadEN5boost10shared_ptrINS_6EntityEEE
+0x23b) [0x853c939]
[head:09133] [14] /opt/plato/release_1.0/bin/engine
(_ZN2GP12Transactions6createEPKN11xercesc_2_77DOMNodeE+0x57f)
[0x8426553]
[head:09133] [15] /opt/plato/release_1.0/bin/engine
(_ZN2GP7FactoryIN5boost10shared_ptrINS_7XmlBaseEEESsPFS4_PKN11xercesc_
2_77DOMNodeEENS_19DefaultFactoryErrorEE12createObjectES8_+0x76)
[0x81ca06a]
[head:09133] [16] /opt/plato/release_1.0/bin/engine
(_ZN2GP16XmlFactoryParser7descentEPN11xercesc_2_77DOMNodeEb+0x5b2)
[0x81cd700]
[head:09133] [17] /opt/plato/release_1.0/bin/engine
(_ZN2GP9XmlParser8traverseEb+0x278) [0x81c1eca]
[head:09133] [18] /opt/plato/release_1.0/bin/engine
(_ZN2GP16XmlFactoryParser8traverseEb+0x19) [0x81c9eeb]
[head:09133] [19] /opt/plato/release_1.0/bin/engine(main+0x1d23)
[0x81617f7]
[head:09133] [20] /lib/tls/i686/cmov/libc.so.6(__libc_start_main
+0xc8) [0xb6348ea8]
[head:09133] [21] /opt/plato/release_1.0/bin/engine
(__gxx_personality_v0+0x15d) [0x815a731]
[head:09133] *** End of error message ***
mpirun noticed that job rank 0 with PID 9133 on node head exited on
signal 11 (Segmentation fault).
2 additional processes aborted (not shown)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
"Half of what I say is meaningless; but I say it so that the other
half may reach you"
Kahlil Gibran