Hello George,

So is it safe for me to assume that my code is good and that you will
remove this bug from next OpenMPI version?

Also I would like to know which future OpenMPI version will incorporate
this fix (so I can try my code in fixed version)?

Thank you,

----

Bogdan Sataric

email: bogdan.sata...@gmail.com
phone: +381 21-485-2441

Teaching & Research Assistant
Chair for Applied Computer Science
Faculty of Technical Sciences, Novi Sad, Serbia

On Thu, Mar 5, 2015 at 6:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Bogdan,
>
> As far as I can tell your code is correct, and the problem is coming from
> Open MPI. More specifically, I used alloca in the optimization stage in
> MPI_Type_commit, and as your arrays of length were too large, alloca failed
> and lead to a segfault. I fixed in the trunk (3c489ea), and this will get
> into our next release.
>
> Unfortunately there is no fix for the 1.6 that I can think of. Apparently,
> you are really the first that run into such kind of problems, so guess you
> are the first creating gigantic datatypes.
>
> Thanks for the bug report,
>   George.
>
>
> On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric <bogdan.sata...@gmail.com>
> wrote:
>
>> I've been having problems with my 3D matrix transpose program. I'm using
>> MPI_Type_indexed in order to allign specific blocks that I want to send and
>> receive across one or multiple nodes of a cluster. Up to few days ago I was
>> able to run my program without any errors. However several test cases on
>> the cluster in last few days exposed segmentation fault when I try to form
>> indexed type on some specific matrix configurations.
>>
>> The code that forms indexed type is as follows:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <mpi.h>
>>
>> int main(int argc, char **argv) {
>>
>>     int Nx = 800;
>>     int Ny = 640;
>>     int Nz = 480;
>>     int gsize;
>>     int i, j;
>>
>>     MPI_Init(&argc, &argv);
>>     MPI_Comm_size(MPI_COMM_WORLD, &gsize);
>>
>>     printf("GSIZE: %d\n", gsize);
>>
>>     MPI_Datatype double_complex_type;
>>     MPI_Datatype block_send_complex_type;
>>
>>     int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>     int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>
>>     MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
>>     MPI_Type_commit(&double_complex_type);
>>
>>     for (i = Ny/gsize - 1; i >= 0 ; i--) {
>>         for (j = 0; j < Nx; j++) {
>>             send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny *
>> Nz;
>>             send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
>>         }
>>     }
>>
>>     MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
>> double_complex_type, &block_send_complex_type);
>>     MPI_Type_commit(&block_send_complex_type);
>>
>>     free(send_displ);
>>     free(send_blocklen);
>>
>>     MPI_Finalize();
>> }
>>
>> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of
>> gsize for this test was 1 (simulation of MPI program on 1 node). The node
>> has 32GB of RAM and no other memory has been allocated (only this code has
>> been run).
>>
>> In code basically I'm creating double_complex_type to represent complex
>> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 *
>> 480 of these values and I'm trying to catch these values in the indexed
>> type. One indexed type block length is the whole Nz "rod" while ordering of
>> these "rods" in displacements array is given by the formula i * Nz + j * Ny
>> * Nz. Basically displacements start from top row, and left column of the 3D
>> matrix. Then I gradually sweep to the right sight of that top row, then go
>> to one row below sweep to the right side and so on until the bottom row.
>>
>> The strange thing is that this formula and algorithm *WORK* if I put
>> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in
>> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz
>> dimension instead of 480. However if I change Nz to even 2 I get
>> segmentation fault error in the MPI_Type_commit call.
>>
>> I checked all of the displacements and they seem fine. There is no
>> overlapping of displacements or going under 0 or over extent of the formed
>> indexed type. Also the size of the datatype is below 4GB (which is I
>> believe limit of MPI datatypes (since MPI_Type_size function returns int *
>> ). Also I believe amount of memory is not an issue as even if I put Nz to
>> be 2, I get the same segmentation fault error, and the node has 32GB of RAM
>> just for this test.
>>
>> What bothers me is that most of other indexed type configurations (with
>> plain MPI_DOUBLE type elements), or complex type with smaller matrix (say
>> 400 * 640 * 480) *WORK* without segmentation fault. Also If I commit the
>> indexed type with MPI_DOUBLE type elements even larger matrices work (say
>> 960 x 800 x 640) which have exactly the same type size as 800 x 640 x 480
>> complex indexed type (just under 4GB)! So basically the type size is not an
>> issue here, but somehow either number of blocks, size of particular blocks,
>> or size of block elements create problems. I'm not sure weather there is
>> problem in implementation of OpenMPI or something in my code is wrong...
>>
>> I would greatly appreciate any help as I've been stuck on this problem
>> for days now and nothing in MPI documentation and the examples I found on
>> the internet is giving me a clue where the error might be.
>>
>> At the end I would like to say that code has been compiled with Open-MPI
>> version 1.6.5.
>>
>> Thank you,
>>
>> Bogdan Sataric
>> ----
>>
>> Bogdan Sataric
>>
>> email: bogdan.sata...@gmail.com
>> phone: +381 21-485-2441
>>
>> Teaching & Research Assistant
>> Chair for Applied Computer Science
>> Faculty of Technical Sciences, Novi Sad, Serbia
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/03/26430.php
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26431.php
>

Reply via email to