On Thu, Mar 5, 2015 at 6:22 PM, Bogdan Sataric <bogdan.sata...@gmail.com> wrote:
> Hello George, > > So is it safe for me to assume that my code is good and that you will > remove this bug from next OpenMPI version? > Yes I think it is safe to assume your code is correct (or at least it follows the specifications you describe in your email). Also I would like to know which future OpenMPI version will incorporate > this fix (so I can try my code in fixed version)? > I pushed the code in the trunk, and created a request to get it into the 1.8.5. So you can try any nightly build starting from tonight, and then any stable release after the 1.8.4 George. > > > Thank you, > > ---- > > Bogdan Sataric > > email: bogdan.sata...@gmail.com > phone: +381 21-485-2441 > > Teaching & Research Assistant > Chair for Applied Computer Science > Faculty of Technical Sciences, Novi Sad, Serbia > > On Thu, Mar 5, 2015 at 6:31 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> Bogdan, >> >> As far as I can tell your code is correct, and the problem is coming from >> Open MPI. More specifically, I used alloca in the optimization stage in >> MPI_Type_commit, and as your arrays of length were too large, alloca failed >> and lead to a segfault. I fixed in the trunk (3c489ea), and this will get >> into our next release. >> >> Unfortunately there is no fix for the 1.6 that I can think of. >> Apparently, you are really the first that run into such kind of problems, >> so guess you are the first creating gigantic datatypes. >> >> Thanks for the bug report, >> George. >> >> >> On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric <bogdan.sata...@gmail.com> >> wrote: >> >>> I've been having problems with my 3D matrix transpose program. I'm using >>> MPI_Type_indexed in order to allign specific blocks that I want to send and >>> receive across one or multiple nodes of a cluster. Up to few days ago I was >>> able to run my program without any errors. However several test cases on >>> the cluster in last few days exposed segmentation fault when I try to form >>> indexed type on some specific matrix configurations. >>> >>> The code that forms indexed type is as follows: >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> #include <mpi.h> >>> >>> int main(int argc, char **argv) { >>> >>> int Nx = 800; >>> int Ny = 640; >>> int Nz = 480; >>> int gsize; >>> int i, j; >>> >>> MPI_Init(&argc, &argv); >>> MPI_Comm_size(MPI_COMM_WORLD, &gsize); >>> >>> printf("GSIZE: %d\n", gsize); >>> >>> MPI_Datatype double_complex_type; >>> MPI_Datatype block_send_complex_type; >>> >>> int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int)); >>> int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int)); >>> >>> MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type); >>> MPI_Type_commit(&double_complex_type); >>> >>> for (i = Ny/gsize - 1; i >= 0 ; i--) { >>> for (j = 0; j < Nx; j++) { >>> send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny * >>> Nz; >>> send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz; >>> } >>> } >>> >>> MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ, >>> double_complex_type, &block_send_complex_type); >>> MPI_Type_commit(&block_send_complex_type); >>> >>> free(send_displ); >>> free(send_blocklen); >>> >>> MPI_Finalize(); >>> } >>> >>> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of >>> gsize for this test was 1 (simulation of MPI program on 1 node). The node >>> has 32GB of RAM and no other memory has been allocated (only this code has >>> been run). >>> >>> In code basically I'm creating double_complex_type to represent complex >>> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 * >>> 480 of these values and I'm trying to catch these values in the indexed >>> type. One indexed type block length is the whole Nz "rod" while ordering of >>> these "rods" in displacements array is given by the formula i * Nz + j * Ny >>> * Nz. Basically displacements start from top row, and left column of the 3D >>> matrix. Then I gradually sweep to the right sight of that top row, then go >>> to one row below sweep to the right side and so on until the bottom row. >>> >>> The strange thing is that this formula and algorithm *WORK* if I put >>> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in >>> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz >>> dimension instead of 480. However if I change Nz to even 2 I get >>> segmentation fault error in the MPI_Type_commit call. >>> >>> I checked all of the displacements and they seem fine. There is no >>> overlapping of displacements or going under 0 or over extent of the formed >>> indexed type. Also the size of the datatype is below 4GB (which is I >>> believe limit of MPI datatypes (since MPI_Type_size function returns int * >>> ). Also I believe amount of memory is not an issue as even if I put Nz to >>> be 2, I get the same segmentation fault error, and the node has 32GB of RAM >>> just for this test. >>> >>> What bothers me is that most of other indexed type configurations (with >>> plain MPI_DOUBLE type elements), or complex type with smaller matrix (say >>> 400 * 640 * 480) *WORK* without segmentation fault. Also If I commit >>> the indexed type with MPI_DOUBLE type elements even larger matrices work >>> (say 960 x 800 x 640) which have exactly the same type size as 800 x 640 x >>> 480 complex indexed type (just under 4GB)! So basically the type size is >>> not an issue here, but somehow either number of blocks, size of particular >>> blocks, or size of block elements create problems. I'm not sure weather >>> there is problem in implementation of OpenMPI or something in my code is >>> wrong... >>> >>> I would greatly appreciate any help as I've been stuck on this problem >>> for days now and nothing in MPI documentation and the examples I found on >>> the internet is giving me a clue where the error might be. >>> >>> At the end I would like to say that code has been compiled with Open-MPI >>> version 1.6.5. >>> >>> Thank you, >>> >>> Bogdan Sataric >>> ---- >>> >>> Bogdan Sataric >>> >>> email: bogdan.sata...@gmail.com >>> phone: +381 21-485-2441 >>> >>> Teaching & Research Assistant >>> Chair for Applied Computer Science >>> Faculty of Technical Sciences, Novi Sad, Serbia >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/03/26430.php >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/03/26431.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/03/26433.php >