On Thu, Mar 5, 2015 at 6:22 PM, Bogdan Sataric <bogdan.sata...@gmail.com>
wrote:

> Hello George,
>
> So is it safe for me to assume that my code is good and that you will
> remove this bug from next OpenMPI version?
>

Yes I think it is safe to assume your code is correct (or at least it
follows the specifications you describe in your email).

Also I would like to know which future OpenMPI version will incorporate
> this fix (so I can try my code in fixed version)?
>

I pushed the code in the trunk, and created a request to get it into the
1.8.5. So you can try any nightly build starting from tonight, and then any
stable release after the 1.8.4

  George.


>

>
> Thank you,
>
> ----
>
> Bogdan Sataric
>
> email: bogdan.sata...@gmail.com
> phone: +381 21-485-2441
>
> Teaching & Research Assistant
> Chair for Applied Computer Science
> Faculty of Technical Sciences, Novi Sad, Serbia
>
> On Thu, Mar 5, 2015 at 6:31 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> Bogdan,
>>
>> As far as I can tell your code is correct, and the problem is coming from
>> Open MPI. More specifically, I used alloca in the optimization stage in
>> MPI_Type_commit, and as your arrays of length were too large, alloca failed
>> and lead to a segfault. I fixed in the trunk (3c489ea), and this will get
>> into our next release.
>>
>> Unfortunately there is no fix for the 1.6 that I can think of.
>> Apparently, you are really the first that run into such kind of problems,
>> so guess you are the first creating gigantic datatypes.
>>
>> Thanks for the bug report,
>>   George.
>>
>>
>> On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric <bogdan.sata...@gmail.com>
>> wrote:
>>
>>> I've been having problems with my 3D matrix transpose program. I'm using
>>> MPI_Type_indexed in order to allign specific blocks that I want to send and
>>> receive across one or multiple nodes of a cluster. Up to few days ago I was
>>> able to run my program without any errors. However several test cases on
>>> the cluster in last few days exposed segmentation fault when I try to form
>>> indexed type on some specific matrix configurations.
>>>
>>> The code that forms indexed type is as follows:
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char **argv) {
>>>
>>>     int Nx = 800;
>>>     int Ny = 640;
>>>     int Nz = 480;
>>>     int gsize;
>>>     int i, j;
>>>
>>>     MPI_Init(&argc, &argv);
>>>     MPI_Comm_size(MPI_COMM_WORLD, &gsize);
>>>
>>>     printf("GSIZE: %d\n", gsize);
>>>
>>>     MPI_Datatype double_complex_type;
>>>     MPI_Datatype block_send_complex_type;
>>>
>>>     int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>>     int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>>
>>>     MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
>>>     MPI_Type_commit(&double_complex_type);
>>>
>>>     for (i = Ny/gsize - 1; i >= 0 ; i--) {
>>>         for (j = 0; j < Nx; j++) {
>>>             send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny *
>>> Nz;
>>>             send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
>>>         }
>>>     }
>>>
>>>     MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
>>> double_complex_type, &block_send_complex_type);
>>>     MPI_Type_commit(&block_send_complex_type);
>>>
>>>     free(send_displ);
>>>     free(send_blocklen);
>>>
>>>     MPI_Finalize();
>>> }
>>>
>>> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of
>>> gsize for this test was 1 (simulation of MPI program on 1 node). The node
>>> has 32GB of RAM and no other memory has been allocated (only this code has
>>> been run).
>>>
>>> In code basically I'm creating double_complex_type to represent complex
>>> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 *
>>> 480 of these values and I'm trying to catch these values in the indexed
>>> type. One indexed type block length is the whole Nz "rod" while ordering of
>>> these "rods" in displacements array is given by the formula i * Nz + j * Ny
>>> * Nz. Basically displacements start from top row, and left column of the 3D
>>> matrix. Then I gradually sweep to the right sight of that top row, then go
>>> to one row below sweep to the right side and so on until the bottom row.
>>>
>>> The strange thing is that this formula and algorithm *WORK* if I put
>>> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in
>>> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz
>>> dimension instead of 480. However if I change Nz to even 2 I get
>>> segmentation fault error in the MPI_Type_commit call.
>>>
>>> I checked all of the displacements and they seem fine. There is no
>>> overlapping of displacements or going under 0 or over extent of the formed
>>> indexed type. Also the size of the datatype is below 4GB (which is I
>>> believe limit of MPI datatypes (since MPI_Type_size function returns int *
>>> ). Also I believe amount of memory is not an issue as even if I put Nz to
>>> be 2, I get the same segmentation fault error, and the node has 32GB of RAM
>>> just for this test.
>>>
>>> What bothers me is that most of other indexed type configurations (with
>>> plain MPI_DOUBLE type elements), or complex type with smaller matrix (say
>>> 400 * 640 * 480) *WORK* without segmentation fault. Also If I commit
>>> the indexed type with MPI_DOUBLE type elements even larger matrices work
>>> (say 960 x 800 x 640) which have exactly the same type size as 800 x 640 x
>>> 480 complex indexed type (just under 4GB)! So basically the type size is
>>> not an issue here, but somehow either number of blocks, size of particular
>>> blocks, or size of block elements create problems. I'm not sure weather
>>> there is problem in implementation of OpenMPI or something in my code is
>>> wrong...
>>>
>>> I would greatly appreciate any help as I've been stuck on this problem
>>> for days now and nothing in MPI documentation and the examples I found on
>>> the internet is giving me a clue where the error might be.
>>>
>>> At the end I would like to say that code has been compiled with Open-MPI
>>> version 1.6.5.
>>>
>>> Thank you,
>>>
>>> Bogdan Sataric
>>> ----
>>>
>>> Bogdan Sataric
>>>
>>> email: bogdan.sata...@gmail.com
>>> phone: +381 21-485-2441
>>>
>>> Teaching & Research Assistant
>>> Chair for Applied Computer Science
>>> Faculty of Technical Sciences, Novi Sad, Serbia
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/03/26430.php
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/03/26431.php
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26433.php
>

Reply via email to