Michael,Based on the image you sent your data-type look gigantic. There are 750K predefined type descriptions in your data-type, for a size of 12MB and an extent of 68MB. The data-type engine managed to optimize your description down to 540K predefined type descriptions. Which is still way beyond the int16_t max limit.
I'm working on a solution for this problem. I'll post back on the mailing list once this is done.
Thanks, george. On Apr 19, 2007, at 4:34 AM, Michael Gauckler (mailing lists) wrote:
Hi George, Thank you for the prompt reply. Indeed we are constructing a data-type description with more than 32k entries.I attached a screenshot of the pData structure (displayed with the TotalView debugger), I hope this helps you. Unfortunately I was not able to use gdb toexecute the call you mentioned.Let me explain the relation of our code with the BOOST libraries: The code I'm debugging at the moment does not use any BOOST library to interface MPI, but it uses the same ideas of how to automatically create the data- types as the BOOST Parallel/Message Passing/MPI [1] library. This is due to the factthat the library is based on our ideas and the goal to factor out our message passing code into an open-source library (see [2]).Even though such an automatically created data-type description might notlead to an optimal performance, I think large descriptors should be supported for several reasons:- even when using MPI, not all parts of the code are performance critical- other MPI implementations support it- passing large/complicated data structures with the BOOST Parallel/ MessagePassing/MPI library(which supports LAM, MPICH and Open-MPI out of the box, see [3]) willprobably lead to the same effect. - the fix has minor to no impact on rest of the code base, at least an appropriate error handling would be expected in the case of a too large data type descriptor.I hope that we are now sure that we have identified the problem as well as the solution and that you are willing to fix the issue in upcoming releases of Open-MPI. If there is anything else I can help with, please let me know.Regards, Michael Gauckler [1] http://lists.boost.org/Archives/boost/2007/01/115347.php [2] http://lists.boost.org/boost-announce/2006/09/0099.php [3]http://www.generic-programming.org/~dgregor/boost.mpi/boost/ parallel/mpi/-----Original Message-----From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] OnBehalf Of George Bosilca Sent: Thursday, April 19, 2007 12:15 AM To: Open MPI UsersSubject: Re: [OMPI users] Datatype construction, serious limitation (was:Signal: Segmentation fault (11) Problem)I am the developer and the maintainer of the data-type engine in Open MPI.And, I'm stunned (!) It never occur to me that someone will ever use adata-type description that need more than 32K entries on the internal stack.Let me explain a little bit. The stack is used to efficiently parse the data-type description. The 32K limit it's not a limit for the number ofpredefined MPI types in the data-type, but a limit for the number ofdifferent data descriptions (a description is like a vector of a predefined type). As an example an MPI_Type_struct with count 10 will use 11 entries.So in order to overload this data description one has to use anMPI_Type_struct with a count bigger than 32K (which might be the case withthe BOOST library you're using in your code).In conclusion if your data-type description contain more than 32K entries, the current implementation will definitively not work for you. How many entries are in your data-type description ? There is an easy way to figureout if this is the problem with your code. Attaching gdb to your process and setting a break in theompi_generic_simple_pack function is the first step. Once there, doing in gdb "call ompi_ddt_dump(pData)" will print a high level description of the data as represented internally in Open MPI. If you can provide the output of this call I can tell you in few seconds if this is the real issue or not.However, this raise another question about the performance you expect fromyour code. A data description with more than 32K items, cannot beefficiently optimized by any automatic data-type engine. Moreover, it cannotbe easily parsed. I suggest that if it's possible to identify accesspatterns that are repetitive, one should use them in order to improve thedata-type description. Thanks, george. On Apr 18, 2007, at 4:16 PM, Michael Gauckler wrote:Dear Open-MPI Developers, investigations on the segmentation fault (see previous postings "Signal: Segmentation fault (11) Problem") lets us suspect that Open-MPI allows only a limited number of elements in the description of user-defined MPI_Datatypes. Our application segmentation-faults when a large user-defined data structure is passed to MPI_Send. The segmentation fault happens in the function ompi_generic_simple_pack in datatype_pack.c when trying to access pElem (Bad address). The structure pElem is set in line 276, where it is retrieved as 276: pElem = &(description[pos_desc]); pos_desc is of type uint32_t with the value 0xffff929f (4294939295), which itself is set on line 271 by a variable of type int16_t and value -1. This leads to the indexing of the description structure at position -1, producing the segmentation fault. The origin of the pos_desc can be faund in the same function at line 271: 271: pos_desc = pStack->index;The structure to which pStack is pointing is of type dt_stack, definedin ompi/datatype/convertor.h starting at line 65, where index is and int16_t and commented with "index in the element description": typedef struct dt_stack { int16_t index; /**< index in the element description */ int16_t type; /**< the type used for the last pack/unpack (original or DT_BYTE) */size_t count; /**< number of times we still have to do it */ptrdiff_t disp; /**< actual displacement depending on the count field */ } dt_stack_t; We therefore conclude that MPI_Datatypes, which are constructed with Open-MPI (in the release of 1.2.1a of April 10th 2007) have the limitation of containing a maximum of 32'768 separate entries. Although changing the type of the index to int32_t solves the problemof the segmentation fault, I would be happy if the author / maintainerof the code could have a look at it and decide if this is viable fix.Having spent a lot of time in hunting down the issue into the Open- MPIcode, I would be glad to see the issue fixed in upcoming releases. Thanx and regards, Michael Gauckler _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users<pdata.png> _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
smime.p7s
Description: S/MIME cryptographic signature