Hello George, So is it safe for me to assume that my code is good and that you will remove this bug from next OpenMPI version?
Also I would like to know which future OpenMPI version will incorporate this fix (so I can try my code in fixed version)? Thank you, ---- Bogdan Sataric email: bogdan.sata...@gmail.com phone: +381 21-485-2441 Teaching & Research Assistant Chair for Applied Computer Science Faculty of Technical Sciences, Novi Sad, Serbia On Thu, Mar 5, 2015 at 6:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > Bogdan, > > As far as I can tell your code is correct, and the problem is coming from > Open MPI. More specifically, I used alloca in the optimization stage in > MPI_Type_commit, and as your arrays of length were too large, alloca failed > and lead to a segfault. I fixed in the trunk (3c489ea), and this will get > into our next release. > > Unfortunately there is no fix for the 1.6 that I can think of. Apparently, > you are really the first that run into such kind of problems, so guess you > are the first creating gigantic datatypes. > > Thanks for the bug report, > George. > > > On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric <bogdan.sata...@gmail.com> > wrote: > >> I've been having problems with my 3D matrix transpose program. I'm using >> MPI_Type_indexed in order to allign specific blocks that I want to send and >> receive across one or multiple nodes of a cluster. Up to few days ago I was >> able to run my program without any errors. However several test cases on >> the cluster in last few days exposed segmentation fault when I try to form >> indexed type on some specific matrix configurations. >> >> The code that forms indexed type is as follows: >> >> #include <stdio.h> >> #include <stdlib.h> >> #include <mpi.h> >> >> int main(int argc, char **argv) { >> >> int Nx = 800; >> int Ny = 640; >> int Nz = 480; >> int gsize; >> int i, j; >> >> MPI_Init(&argc, &argv); >> MPI_Comm_size(MPI_COMM_WORLD, &gsize); >> >> printf("GSIZE: %d\n", gsize); >> >> MPI_Datatype double_complex_type; >> MPI_Datatype block_send_complex_type; >> >> int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int)); >> int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int)); >> >> MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type); >> MPI_Type_commit(&double_complex_type); >> >> for (i = Ny/gsize - 1; i >= 0 ; i--) { >> for (j = 0; j < Nx; j++) { >> send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny * >> Nz; >> send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz; >> } >> } >> >> MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ, >> double_complex_type, &block_send_complex_type); >> MPI_Type_commit(&block_send_complex_type); >> >> free(send_displ); >> free(send_blocklen); >> >> MPI_Finalize(); >> } >> >> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of >> gsize for this test was 1 (simulation of MPI program on 1 node). The node >> has 32GB of RAM and no other memory has been allocated (only this code has >> been run). >> >> In code basically I'm creating double_complex_type to represent complex >> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 * >> 480 of these values and I'm trying to catch these values in the indexed >> type. One indexed type block length is the whole Nz "rod" while ordering of >> these "rods" in displacements array is given by the formula i * Nz + j * Ny >> * Nz. Basically displacements start from top row, and left column of the 3D >> matrix. Then I gradually sweep to the right sight of that top row, then go >> to one row below sweep to the right side and so on until the bottom row. >> >> The strange thing is that this formula and algorithm *WORK* if I put >> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in >> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz >> dimension instead of 480. However if I change Nz to even 2 I get >> segmentation fault error in the MPI_Type_commit call. >> >> I checked all of the displacements and they seem fine. There is no >> overlapping of displacements or going under 0 or over extent of the formed >> indexed type. Also the size of the datatype is below 4GB (which is I >> believe limit of MPI datatypes (since MPI_Type_size function returns int * >> ). Also I believe amount of memory is not an issue as even if I put Nz to >> be 2, I get the same segmentation fault error, and the node has 32GB of RAM >> just for this test. >> >> What bothers me is that most of other indexed type configurations (with >> plain MPI_DOUBLE type elements), or complex type with smaller matrix (say >> 400 * 640 * 480) *WORK* without segmentation fault. Also If I commit the >> indexed type with MPI_DOUBLE type elements even larger matrices work (say >> 960 x 800 x 640) which have exactly the same type size as 800 x 640 x 480 >> complex indexed type (just under 4GB)! So basically the type size is not an >> issue here, but somehow either number of blocks, size of particular blocks, >> or size of block elements create problems. I'm not sure weather there is >> problem in implementation of OpenMPI or something in my code is wrong... >> >> I would greatly appreciate any help as I've been stuck on this problem >> for days now and nothing in MPI documentation and the examples I found on >> the internet is giving me a clue where the error might be. >> >> At the end I would like to say that code has been compiled with Open-MPI >> version 1.6.5. >> >> Thank you, >> >> Bogdan Sataric >> ---- >> >> Bogdan Sataric >> >> email: bogdan.sata...@gmail.com >> phone: +381 21-485-2441 >> >> Teaching & Research Assistant >> Chair for Applied Computer Science >> Faculty of Technical Sciences, Novi Sad, Serbia >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/03/26430.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/03/26431.php >