Hello Tom, Actually I have tried using: MPI_Type_Create_Hindexed but the same problem persisted for the same matrix dimensions.
Displacements array values are not a problem. Matrix of a size 800x640x480 creates type that is a bit less then 4GB large in case of complex datatype. It definitely fits in 32 bit range. So it is not an 32-64 bit issue, at least not for the value of displacements in this case. Regards, ---- Bogdan Sataric email: bogdan.sata...@gmail.com phone: +381 21-485-2441 Teaching & Research Assistant Chair for Applied Computer Science Faculty of Technical Sciences, Novi Sad, Serbia On Thu, Mar 5, 2015 at 8:13 PM, Tom Rosmond <rosm...@reachone.com> wrote: > Actually, you are not the first to encounter the problem with > 'MPI_Type_indexed' for very large datatypes. I also run with a 1.6 > release, and solved the problem by switching to > 'MPI_Type_Create_Hindexed' for the datatype. The critical difference is > that the displacements for 'MPI_type_indexed' are type integer, i.e. 32 > bit values, while for 'MPI_Type_Create_Hindexed' the displacements are > in bytes, but with type 'MPI_Address_Kind', i.e. normally 64 bit, and > therefore of effectively infinite size. Otherwise the 2 'types' can be > used identically. > > T. Rosmond > > > On Thu, 2015-03-05 at 12:31 -0500, George Bosilca wrote: > > Bogdan, > > > > > > As far as I can tell your code is correct, and the problem is coming > > from Open MPI. More specifically, I used alloca in the optimization > > stage in MPI_Type_commit, and as your arrays of length were too large, > > alloca failed and lead to a segfault. I fixed in the trunk (3c489ea), > > and this will get into our next release. > > > > > > Unfortunately there is no fix for the 1.6 that I can think of. > > Apparently, you are really the first that run into such kind of > > problems, so guess you are the first creating gigantic datatypes. > > > > > > Thanks for the bug report, > > George. > > > > > > > > On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric > > <bogdan.sata...@gmail.com> wrote: > > I've been having problems with my 3D matrix transpose program. > > I'm using MPI_Type_indexed in order to allign specific blocks > > that I want to send and receive across one or multiple nodes > > of a cluster. Up to few days ago I was able to run my program > > without any errors. However several test cases on the cluster > > in last few days exposed segmentation fault when I try to form > > indexed type on some specific matrix configurations. > > > > > > > > The code that forms indexed type is as follows: > > > > > > #include <stdio.h> > > #include <stdlib.h> > > #include <mpi.h> > > > > > > int main(int argc, char **argv) { > > > > > > int Nx = 800; > > int Ny = 640; > > int Nz = 480; > > int gsize; > > int i, j; > > > > > > MPI_Init(&argc, &argv); > > MPI_Comm_size(MPI_COMM_WORLD, &gsize); > > > > > > printf("GSIZE: %d\n", gsize); > > > > > > MPI_Datatype double_complex_type; > > MPI_Datatype block_send_complex_type; > > > > > > int * send_displ = (int *) malloc(Nx * Ny/gsize * > > sizeof(int)); > > int * send_blocklen = (int *) malloc(Nx * Ny/gsize * > > sizeof(int)); > > > > > > MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type); > > MPI_Type_commit(&double_complex_type); > > > > > > for (i = Ny/gsize - 1; i >= 0 ; i--) { > > for (j = 0; j < Nx; j++) { > > send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + > > j * Ny * Nz; > > send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz; > > > > } > > } > > > > > > MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ, > > double_complex_type, &block_send_complex_type); > > MPI_Type_commit(&block_send_complex_type); > > > > > > free(send_displ); > > free(send_blocklen); > > > > > > MPI_Finalize(); > > } > > > > > > Values of the Nx, Ny and Nz respectively are 800, 640 and 480. > > Value of gsize for this test was 1 (simulation of MPI program > > on 1 node). The node has 32GB of RAM and no other memory has > > been allocated (only this code has been run). > > > > > > In code basically I'm creating double_complex_type to > > represent complex number (2 contiguous MPI_DOUBLE) values. The > > whole matrix has 800 * 640 * 480 of these values and I'm > > trying to catch these values in the indexed type. One indexed > > type block length is the whole Nz "rod" while ordering of > > these "rods" in displacements array is given by the formula i > > * Nz + j * Ny * Nz. Basically displacements start from top > > row, and left column of the 3D matrix. Then I gradually sweep > > to the right sight of that top row, then go to one row below > > sweep to the right side and so on until the bottom row. > > > > > > The strange thing is that this formula and algorithm WORK if I > > put MPI_DOUBLE type instead of derived complex type (1 instead > > of 2 in MPI_TYPE_CONTIGIOUS). Also this formula WORKS if I put > > 1 for Nz dimension instead of 480. However if I change Nz to > > even 2 I get segmentation fault error in the MPI_Type_commit > > call. > > > > > > I checked all of the displacements and they seem fine. There > > is no overlapping of displacements or going under 0 or over > > extent of the formed indexed type. Also the size of the > > datatype is below 4GB (which is I believe limit of MPI > > datatypes (since MPI_Type_size function returns int * ). Also > > I believe amount of memory is not an issue as even if I put Nz > > to be 2, I get the same segmentation fault error, and the node > > has 32GB of RAM just for this test. > > > > > > What bothers me is that most of other indexed type > > configurations (with plain MPI_DOUBLE type elements), or > > complex type with smaller matrix (say 400 * 640 * 480) WORK > > without segmentation fault. Also If I commit the indexed type > > with MPI_DOUBLE type elements even larger matrices work (say > > 960 x 800 x 640) which have exactly the same type size as 800 > > x 640 x 480 complex indexed type (just under 4GB)! So > > basically the type size is not an issue here, but somehow > > either number of blocks, size of particular blocks, or size of > > block elements create problems. I'm not sure weather there is > > problem in implementation of OpenMPI or something in my code > > is wrong... > > > > > > I would greatly appreciate any help as I've been stuck on this > > problem for days now and nothing in MPI documentation and the > > examples I found on the internet is giving me a clue where the > > error might be. > > > > > > At the end I would like to say that code has been compiled > > with Open-MPI version 1.6.5. > > > > > > Thank you, > > > > > > Bogdan Sataric > > ---- > > > > > > Bogdan Sataric > > > > > > > > email: bogdan.sata...@gmail.com > > phone: +381 21-485-2441 > > > > > > Teaching & Research Assistant > > Chair for Applied Computer Science > > Faculty of Technical Sciences, Novi Sad, Serbia > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/03/26430.php > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/03/26431.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/03/26432.php >