Re: [OMPI users] Datatype construction, serious limitation (was: Signal: Segmentation fault (11) Problem)

George Bosilca Wed, 18 Apr 2007 18:14:54 -0400

I am the developer and the maintainer of the data-type engine in Open MPI. And, I'm stunned (!) It never occur to me that someone will ever use a data-type description that need more than 32K entries on the internal stack.

Let me explain a little bit. The stack is used to efficiently parse the data-type description. The 32K limit it's not a limit for the number of predefined MPI types in the data-type, but a limit for the number of different data descriptions (a description is like a vector of a predefined type). As an example an MPI_Type_struct with count 10 will use 11 entries. So in order to overload this data description one has to use an MPI_Type_struct with a count bigger than 32K (which might be the case with the BOOST library you're using in your code).

In conclusion if your data-type description contain more than 32K entries, the current implementation will definitively not work for you. How many entries are in your data-type description ? There is an easy way to figure out if this is the problem with your code. Attaching gdb to your process and setting a break in the ompi_generic_simple_pack function is the first step. Once there, doing in gdb "call ompi_ddt_dump(pData)" will print a high level description of the data as represented internally in Open MPI. If you can provide the output of this call I can tell you in few seconds if this is the real issue or not.

However, this raise another question about the performance you expect from your code. A data description with more than 32K items, cannot be efficiently optimized by any automatic data-type engine. Moreover, it cannot be easily parsed. I suggest that if it's possible to identify access patterns that are repetitive, one should use them in order to improve the data-type description.


  Thanks,
    george.

On Apr 18, 2007, at 4:16 PM, Michael Gauckler wrote:

Dear Open-MPI Developers,
investigations on the segmentation fault (see previous postings "Signal: Segmentation fault (11) Problem") lets us suspect that Open-MPI allows only a limited number of elements in the description of user-defined MPI_Datatypes.
Our application segmentation-faults when a large user-defined data structure is passed to MPI_Send.
The segmentation fault happens in the function ompi_generic_simple_pack in datatype_pack.c when trying to access pElem (Bad address). The structure pElem is set in line 276, where it is retrieved as
276: pElem = &(description[pos_desc]);
pos_desc is of type uint32_t with the value 0xffff929f (4294939295), which itself is set on line 271 by a variable of type int16_t and value -1. This leads to the indexing of the description structure at position -1, producing the segmentation fault. The origin of the pos_desc can be faund in the same function at line 271:
271: pos_desc = pStack->index;
The structure to which pStack is pointing is of type dt_stack, defined in ompi/datatype/convertor.h starting at line 65, where index is and int16_t and commented with “index in the element description”:
typedef struct dt_stack {

    int16_t   index;    /**< index in the element description */
int16_t type; /**< the type used for the last pack/unpack (original or DT_BYTE) */
    size_t    count;    /**< number of times we still have to do it */
ptrdiff_t disp; /**< actual displacement depending on the count field */
} dt_stack_t;
We therefore conclude that MPI_Datatypes, which are constructed with Open-MPI (in the release of 1.2.1a of April 10th 2007) have the limitation of containing a maximum of 32’768 separate entries.
Although changing the type of the index to int32_t solves the problem of the segmentation fault, I would be happy if the author / maintainer of the code could have a look at it and decide if this is viable fix. Having spent a lot of time in hunting down the issue into the Open-MPI code, I would be glad to see the issue fixed in upcoming releases.
Thanx and regards,
Michael Gauckler



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI users] Datatype construction, serious limitation (was: Signal: Segmentation fault (11) Problem)

Reply via email to