[OMPI users] Segmentation fault with MPI_Type_indexed

2015-03-05 Thread Bogdan Sataric
I've been having problems with my 3D matrix transpose program. I'm using
MPI_Type_indexed in order to allign specific blocks that I want to send and
receive across one or multiple nodes of a cluster. Up to few days ago I was
able to run my program without any errors. However several test cases on
the cluster in last few days exposed segmentation fault when I try to form
indexed type on some specific matrix configurations.

The code that forms indexed type is as follows:

#include 
#include 
#include 

int main(int argc, char **argv) {

int Nx = 800;
int Ny = 640;
int Nz = 480;
int gsize;
int i, j;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &gsize);

printf("GSIZE: %d\n", gsize);

MPI_Datatype double_complex_type;
MPI_Datatype block_send_complex_type;

int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int));
int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int));

MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
MPI_Type_commit(&double_complex_type);

for (i = Ny/gsize - 1; i >= 0 ; i--) {
for (j = 0; j < Nx; j++) {
send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny * Nz;

send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
}
}

MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
double_complex_type, &block_send_complex_type);
MPI_Type_commit(&block_send_complex_type);

free(send_displ);
free(send_blocklen);

MPI_Finalize();
}

Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of
gsize for this test was 1 (simulation of MPI program on 1 node). The node
has 32GB of RAM and no other memory has been allocated (only this code has
been run).

In code basically I'm creating double_complex_type to represent complex
number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 *
480 of these values and I'm trying to catch these values in the indexed
type. One indexed type block length is the whole Nz "rod" while ordering of
these "rods" in displacements array is given by the formula i * Nz + j * Ny
* Nz. Basically displacements start from top row, and left column of the 3D
matrix. Then I gradually sweep to the right sight of that top row, then go
to one row below sweep to the right side and so on until the bottom row.

The strange thing is that this formula and algorithm *WORK* if I put
MPI_DOUBLE type instead of derived complex type (1 instead of 2 in
MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz dimension
instead of 480. However if I change Nz to even 2 I get segmentation fault
error in the MPI_Type_commit call.

I checked all of the displacements and they seem fine. There is no
overlapping of displacements or going under 0 or over extent of the formed
indexed type. Also the size of the datatype is below 4GB (which is I
believe limit of MPI datatypes (since MPI_Type_size function returns int *
). Also I believe amount of memory is not an issue as even if I put Nz to
be 2, I get the same segmentation fault error, and the node has 32GB of RAM
just for this test.

What bothers me is that most of other indexed type configurations (with
plain MPI_DOUBLE type elements), or complex type with smaller matrix (say
400 * 640 * 480) *WORK* without segmentation fault. Also If I commit the
indexed type with MPI_DOUBLE type elements even larger matrices work (say
960 x 800 x 640) which have exactly the same type size as 800 x 640 x 480
complex indexed type (just under 4GB)! So basically the type size is not an
issue here, but somehow either number of blocks, size of particular blocks,
or size of block elements create problems. I'm not sure weather there is
problem in implementation of OpenMPI or something in my code is wrong...

I would greatly appreciate any help as I've been stuck on this problem for
days now and nothing in MPI documentation and the examples I found on the
internet is giving me a clue where the error might be.

At the end I would like to say that code has been compiled with Open-MPI
version 1.6.5.

Thank you,

Bogdan Sataric


Bogdan Sataric

email: bogdan.sata...@gmail.com
phone: +381 21-485-2441

Teaching & Research Assistant
Chair for Applied Computer Science
Faculty of Technical Sciences, Novi Sad, Serbia


Re: [OMPI users] Segmentation fault with MPI_Type_indexed

2015-03-05 Thread George Bosilca
Bogdan,

As far as I can tell your code is correct, and the problem is coming from
Open MPI. More specifically, I used alloca in the optimization stage in
MPI_Type_commit, and as your arrays of length were too large, alloca failed
and lead to a segfault. I fixed in the trunk (3c489ea), and this will get
into our next release.

Unfortunately there is no fix for the 1.6 that I can think of. Apparently,
you are really the first that run into such kind of problems, so guess you
are the first creating gigantic datatypes.

Thanks for the bug report,
  George.


On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric 
wrote:

> I've been having problems with my 3D matrix transpose program. I'm using
> MPI_Type_indexed in order to allign specific blocks that I want to send and
> receive across one or multiple nodes of a cluster. Up to few days ago I was
> able to run my program without any errors. However several test cases on
> the cluster in last few days exposed segmentation fault when I try to form
> indexed type on some specific matrix configurations.
>
> The code that forms indexed type is as follows:
>
> #include 
> #include 
> #include 
>
> int main(int argc, char **argv) {
>
> int Nx = 800;
> int Ny = 640;
> int Nz = 480;
> int gsize;
> int i, j;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD, &gsize);
>
> printf("GSIZE: %d\n", gsize);
>
> MPI_Datatype double_complex_type;
> MPI_Datatype block_send_complex_type;
>
> int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int));
> int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>
> MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
> MPI_Type_commit(&double_complex_type);
>
> for (i = Ny/gsize - 1; i >= 0 ; i--) {
> for (j = 0; j < Nx; j++) {
> send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny *
> Nz;
> send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
> }
> }
>
> MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
> double_complex_type, &block_send_complex_type);
> MPI_Type_commit(&block_send_complex_type);
>
> free(send_displ);
> free(send_blocklen);
>
> MPI_Finalize();
> }
>
> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of
> gsize for this test was 1 (simulation of MPI program on 1 node). The node
> has 32GB of RAM and no other memory has been allocated (only this code has
> been run).
>
> In code basically I'm creating double_complex_type to represent complex
> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 *
> 480 of these values and I'm trying to catch these values in the indexed
> type. One indexed type block length is the whole Nz "rod" while ordering of
> these "rods" in displacements array is given by the formula i * Nz + j * Ny
> * Nz. Basically displacements start from top row, and left column of the 3D
> matrix. Then I gradually sweep to the right sight of that top row, then go
> to one row below sweep to the right side and so on until the bottom row.
>
> The strange thing is that this formula and algorithm *WORK* if I put
> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in
> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz
> dimension instead of 480. However if I change Nz to even 2 I get
> segmentation fault error in the MPI_Type_commit call.
>
> I checked all of the displacements and they seem fine. There is no
> overlapping of displacements or going under 0 or over extent of the formed
> indexed type. Also the size of the datatype is below 4GB (which is I
> believe limit of MPI datatypes (since MPI_Type_size function returns int *
> ). Also I believe amount of memory is not an issue as even if I put Nz to
> be 2, I get the same segmentation fault error, and the node has 32GB of RAM
> just for this test.
>
> What bothers me is that most of other indexed type configurations (with
> plain MPI_DOUBLE type elements), or complex type with smaller matrix (say
> 400 * 640 * 480) *WORK* without segmentation fault. Also If I commit the
> indexed type with MPI_DOUBLE type elements even larger matrices work (say
> 960 x 800 x 640) which have exactly the same type size as 800 x 640 x 480
> complex indexed type (just under 4GB)! So basically the type size is not an
> issue here, but somehow either number of blocks, size of particular blocks,
> or size of block elements create problems. I'm not sure weather there is
> problem in implementation of OpenMPI or something in my code is wrong...
>
> I would greatly appreciate any help as I've been stuck on this problem for
> days now and nothing in MPI documentation and the examples I found on the
> internet is giving me a clue where the error might be.
>
> At the end I would like to say that code has been compiled with Open-MPI
> version 1.6.5.
>
> Thank you,
>
> Bogdan Sataric
> 
>
> Bogdan Sataric
>
> email: bogdan.sata...@gmail.com
> phone: +381 21-485-24

Re: [OMPI users] Segmentation fault with MPI_Type_indexed

2015-03-05 Thread Tom Rosmond
Actually, you are not the first to encounter the problem with
'MPI_Type_indexed' for very large datatypes.  I also run with a 1.6
release, and solved the problem by switching to
'MPI_Type_Create_Hindexed' for the datatype.  The critical difference is
that the displacements for 'MPI_type_indexed' are type integer, i.e. 32
bit values, while for 'MPI_Type_Create_Hindexed' the displacements are
in bytes, but with type 'MPI_Address_Kind', i.e. normally 64 bit, and
therefore of effectively infinite size.  Otherwise the 2 'types' can be
used identically.

T. Rosmond


On Thu, 2015-03-05 at 12:31 -0500, George Bosilca wrote:
> Bogdan,
> 
> 
> As far as I can tell your code is correct, and the problem is coming
> from Open MPI. More specifically, I used alloca in the optimization
> stage in MPI_Type_commit, and as your arrays of length were too large,
> alloca failed and lead to a segfault. I fixed in the trunk (3c489ea),
> and this will get into our next release.
> 
> 
> Unfortunately there is no fix for the 1.6 that I can think of.
> Apparently, you are really the first that run into such kind of
> problems, so guess you are the first creating gigantic datatypes.
> 
> 
> Thanks for the bug report,
>   George.
> 
> 
> 
> On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric
>  wrote:
> I've been having problems with my 3D matrix transpose program.
> I'm using MPI_Type_indexed in order to allign specific blocks
> that I want to send and receive across one or multiple nodes
> of a cluster. Up to few days ago I was able to run my program
> without any errors. However several test cases on the cluster
> in last few days exposed segmentation fault when I try to form
> indexed type on some specific matrix configurations.
> 
> 
> 
> The code that forms indexed type is as follows:
> 
> 
> #include 
> #include 
> #include 
> 
> 
> int main(int argc, char **argv) {
> 
> 
> int Nx = 800;
> int Ny = 640;
> int Nz = 480;
> int gsize;
> int i, j;
> 
> 
> MPI_Init(&argc, &argv); 
> MPI_Comm_size(MPI_COMM_WORLD, &gsize);  
> 
> 
> printf("GSIZE: %d\n", gsize);
> 
> 
> MPI_Datatype double_complex_type;
> MPI_Datatype block_send_complex_type; 
> 
> 
> int * send_displ = (int *) malloc(Nx * Ny/gsize *
> sizeof(int));
> int * send_blocklen = (int *) malloc(Nx * Ny/gsize *
> sizeof(int));
> 
> 
> MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type); 
> MPI_Type_commit(&double_complex_type);
> 
> 
> for (i = Ny/gsize - 1; i >= 0 ; i--) { 
> for (j = 0; j < Nx; j++) { 
> send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz +
> j * Ny * Nz;  
> send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
> 
> }
> }
> 
> 
> MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
> double_complex_type, &block_send_complex_type); 
> MPI_Type_commit(&block_send_complex_type);
> 
> 
> free(send_displ);
> free(send_blocklen);
> 
> 
> MPI_Finalize();
> }
> 
> 
> Values of the Nx, Ny and Nz respectively are 800, 640 and 480.
> Value of gsize for this test was 1 (simulation of MPI program
> on 1 node). The node has 32GB of RAM and no other memory has
> been allocated (only this code has been run).
> 
> 
> In code basically I'm creating double_complex_type to
> represent complex number (2 contiguous MPI_DOUBLE) values. The
> whole matrix has 800 * 640 * 480 of these values and I'm
> trying to catch these values in the indexed type. One indexed
> type block length is the whole Nz "rod" while ordering of
> these "rods" in displacements array is given by the formula i
> * Nz + j * Ny * Nz. Basically displacements start from top
> row, and left column of the 3D matrix. Then I gradually sweep
> to the right sight of that top row, then go to one row below
> sweep to the right side and so on until the bottom row.
> 
> 
> The strange thing is that this formula and algorithm WORK if I
> put MPI_DOUBLE type instead of derived complex type (1 instead
> of 2 in MPI_TYPE_CONTIGIOUS). Also this formula WORKS if I put
> 1 for Nz dimension instead of 480. However if I change Nz to
> even 2 I get segmentation fault error in the MPI_Type_commit
> call.
> 
>  

Re: [OMPI users] Segmentation fault with MPI_Type_indexed

2015-03-05 Thread Bogdan Sataric
Hello George,

So is it safe for me to assume that my code is good and that you will
remove this bug from next OpenMPI version?

Also I would like to know which future OpenMPI version will incorporate
this fix (so I can try my code in fixed version)?

Thank you,



Bogdan Sataric

email: bogdan.sata...@gmail.com
phone: +381 21-485-2441

Teaching & Research Assistant
Chair for Applied Computer Science
Faculty of Technical Sciences, Novi Sad, Serbia

On Thu, Mar 5, 2015 at 6:31 PM, George Bosilca  wrote:

> Bogdan,
>
> As far as I can tell your code is correct, and the problem is coming from
> Open MPI. More specifically, I used alloca in the optimization stage in
> MPI_Type_commit, and as your arrays of length were too large, alloca failed
> and lead to a segfault. I fixed in the trunk (3c489ea), and this will get
> into our next release.
>
> Unfortunately there is no fix for the 1.6 that I can think of. Apparently,
> you are really the first that run into such kind of problems, so guess you
> are the first creating gigantic datatypes.
>
> Thanks for the bug report,
>   George.
>
>
> On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric 
> wrote:
>
>> I've been having problems with my 3D matrix transpose program. I'm using
>> MPI_Type_indexed in order to allign specific blocks that I want to send and
>> receive across one or multiple nodes of a cluster. Up to few days ago I was
>> able to run my program without any errors. However several test cases on
>> the cluster in last few days exposed segmentation fault when I try to form
>> indexed type on some specific matrix configurations.
>>
>> The code that forms indexed type is as follows:
>>
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char **argv) {
>>
>> int Nx = 800;
>> int Ny = 640;
>> int Nz = 480;
>> int gsize;
>> int i, j;
>>
>> MPI_Init(&argc, &argv);
>> MPI_Comm_size(MPI_COMM_WORLD, &gsize);
>>
>> printf("GSIZE: %d\n", gsize);
>>
>> MPI_Datatype double_complex_type;
>> MPI_Datatype block_send_complex_type;
>>
>> int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>> int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>
>> MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
>> MPI_Type_commit(&double_complex_type);
>>
>> for (i = Ny/gsize - 1; i >= 0 ; i--) {
>> for (j = 0; j < Nx; j++) {
>> send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny *
>> Nz;
>> send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
>> }
>> }
>>
>> MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
>> double_complex_type, &block_send_complex_type);
>> MPI_Type_commit(&block_send_complex_type);
>>
>> free(send_displ);
>> free(send_blocklen);
>>
>> MPI_Finalize();
>> }
>>
>> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of
>> gsize for this test was 1 (simulation of MPI program on 1 node). The node
>> has 32GB of RAM and no other memory has been allocated (only this code has
>> been run).
>>
>> In code basically I'm creating double_complex_type to represent complex
>> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 *
>> 480 of these values and I'm trying to catch these values in the indexed
>> type. One indexed type block length is the whole Nz "rod" while ordering of
>> these "rods" in displacements array is given by the formula i * Nz + j * Ny
>> * Nz. Basically displacements start from top row, and left column of the 3D
>> matrix. Then I gradually sweep to the right sight of that top row, then go
>> to one row below sweep to the right side and so on until the bottom row.
>>
>> The strange thing is that this formula and algorithm *WORK* if I put
>> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in
>> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz
>> dimension instead of 480. However if I change Nz to even 2 I get
>> segmentation fault error in the MPI_Type_commit call.
>>
>> I checked all of the displacements and they seem fine. There is no
>> overlapping of displacements or going under 0 or over extent of the formed
>> indexed type. Also the size of the datatype is below 4GB (which is I
>> believe limit of MPI datatypes (since MPI_Type_size function returns int *
>> ). Also I believe amount of memory is not an issue as even if I put Nz to
>> be 2, I get the same segmentation fault error, and the node has 32GB of RAM
>> just for this test.
>>
>> What bothers me is that most of other indexed type configurations (with
>> plain MPI_DOUBLE type elements), or complex type with smaller matrix (say
>> 400 * 640 * 480) *WORK* without segmentation fault. Also If I commit the
>> indexed type with MPI_DOUBLE type elements even larger matrices work (say
>> 960 x 800 x 640) which have exactly the same type size as 800 x 640 x 480
>> complex indexed type (just under 4GB)! So basically the type size is not an
>> issue here, but so

Re: [OMPI users] Segmentation fault with MPI_Type_indexed

2015-03-05 Thread Bogdan Sataric
Hello Tom,

Actually I have tried using: MPI_Type_Create_Hindexed but the same problem
persisted for the same matrix dimensions.

Displacements array values are not a problem. Matrix of a size 800x640x480
creates type that is a bit less then 4GB large in case of complex datatype.
It definitely fits in 32 bit range. So it is not an 32-64 bit issue, at
least not for the value of displacements in this case.

Regards,



Bogdan Sataric

email: bogdan.sata...@gmail.com
phone: +381 21-485-2441

Teaching & Research Assistant
Chair for Applied Computer Science
Faculty of Technical Sciences, Novi Sad, Serbia

On Thu, Mar 5, 2015 at 8:13 PM, Tom Rosmond  wrote:

> Actually, you are not the first to encounter the problem with
> 'MPI_Type_indexed' for very large datatypes.  I also run with a 1.6
> release, and solved the problem by switching to
> 'MPI_Type_Create_Hindexed' for the datatype.  The critical difference is
> that the displacements for 'MPI_type_indexed' are type integer, i.e. 32
> bit values, while for 'MPI_Type_Create_Hindexed' the displacements are
> in bytes, but with type 'MPI_Address_Kind', i.e. normally 64 bit, and
> therefore of effectively infinite size.  Otherwise the 2 'types' can be
> used identically.
>
> T. Rosmond
>
>
> On Thu, 2015-03-05 at 12:31 -0500, George Bosilca wrote:
> > Bogdan,
> >
> >
> > As far as I can tell your code is correct, and the problem is coming
> > from Open MPI. More specifically, I used alloca in the optimization
> > stage in MPI_Type_commit, and as your arrays of length were too large,
> > alloca failed and lead to a segfault. I fixed in the trunk (3c489ea),
> > and this will get into our next release.
> >
> >
> > Unfortunately there is no fix for the 1.6 that I can think of.
> > Apparently, you are really the first that run into such kind of
> > problems, so guess you are the first creating gigantic datatypes.
> >
> >
> > Thanks for the bug report,
> >   George.
> >
> >
> >
> > On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric
> >  wrote:
> > I've been having problems with my 3D matrix transpose program.
> > I'm using MPI_Type_indexed in order to allign specific blocks
> > that I want to send and receive across one or multiple nodes
> > of a cluster. Up to few days ago I was able to run my program
> > without any errors. However several test cases on the cluster
> > in last few days exposed segmentation fault when I try to form
> > indexed type on some specific matrix configurations.
> >
> >
> >
> > The code that forms indexed type is as follows:
> >
> >
> > #include 
> > #include 
> > #include 
> >
> >
> > int main(int argc, char **argv) {
> >
> >
> > int Nx = 800;
> > int Ny = 640;
> > int Nz = 480;
> > int gsize;
> > int i, j;
> >
> >
> > MPI_Init(&argc, &argv);
> > MPI_Comm_size(MPI_COMM_WORLD, &gsize);
> >
> >
> > printf("GSIZE: %d\n", gsize);
> >
> >
> > MPI_Datatype double_complex_type;
> > MPI_Datatype block_send_complex_type;
> >
> >
> > int * send_displ = (int *) malloc(Nx * Ny/gsize *
> > sizeof(int));
> > int * send_blocklen = (int *) malloc(Nx * Ny/gsize *
> > sizeof(int));
> >
> >
> > MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
> > MPI_Type_commit(&double_complex_type);
> >
> >
> > for (i = Ny/gsize - 1; i >= 0 ; i--) {
> > for (j = 0; j < Nx; j++) {
> > send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz +
> > j * Ny * Nz;
> > send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
> >
> > }
> > }
> >
> >
> > MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
> > double_complex_type, &block_send_complex_type);
> > MPI_Type_commit(&block_send_complex_type);
> >
> >
> > free(send_displ);
> > free(send_blocklen);
> >
> >
> > MPI_Finalize();
> > }
> >
> >
> > Values of the Nx, Ny and Nz respectively are 800, 640 and 480.
> > Value of gsize for this test was 1 (simulation of MPI program
> > on 1 node). The node has 32GB of RAM and no other memory has
> > been allocated (only this code has been run).
> >
> >
> > In code basically I'm creating double_complex_type to
> > represent complex number (2 contiguous MPI_DOUBLE) values. The
> > whole matrix has 800 * 640 * 480 of these values and I'm
> > trying to catch these values in the indexed type. One indexed
> > type block length is the whole Nz "rod" while ordering of
> > these "rods" in displacements array is given by the formula i
> > * Nz + j * Ny * Nz. Basically displacements start from top
> > row, and left column of the 3D matrix. Then I gradual

Re: [OMPI users] Segmentation fault with MPI_Type_indexed

2015-03-05 Thread George Bosilca
On Thu, Mar 5, 2015 at 6:22 PM, Bogdan Sataric 
wrote:

> Hello George,
>
> So is it safe for me to assume that my code is good and that you will
> remove this bug from next OpenMPI version?
>

Yes I think it is safe to assume your code is correct (or at least it
follows the specifications you describe in your email).

Also I would like to know which future OpenMPI version will incorporate
> this fix (so I can try my code in fixed version)?
>

I pushed the code in the trunk, and created a request to get it into the
1.8.5. So you can try any nightly build starting from tonight, and then any
stable release after the 1.8.4

  George.


>

>
> Thank you,
>
> 
>
> Bogdan Sataric
>
> email: bogdan.sata...@gmail.com
> phone: +381 21-485-2441
>
> Teaching & Research Assistant
> Chair for Applied Computer Science
> Faculty of Technical Sciences, Novi Sad, Serbia
>
> On Thu, Mar 5, 2015 at 6:31 PM, George Bosilca 
> wrote:
>
>> Bogdan,
>>
>> As far as I can tell your code is correct, and the problem is coming from
>> Open MPI. More specifically, I used alloca in the optimization stage in
>> MPI_Type_commit, and as your arrays of length were too large, alloca failed
>> and lead to a segfault. I fixed in the trunk (3c489ea), and this will get
>> into our next release.
>>
>> Unfortunately there is no fix for the 1.6 that I can think of.
>> Apparently, you are really the first that run into such kind of problems,
>> so guess you are the first creating gigantic datatypes.
>>
>> Thanks for the bug report,
>>   George.
>>
>>
>> On Thu, Mar 5, 2015 at 9:09 AM, Bogdan Sataric 
>> wrote:
>>
>>> I've been having problems with my 3D matrix transpose program. I'm using
>>> MPI_Type_indexed in order to allign specific blocks that I want to send and
>>> receive across one or multiple nodes of a cluster. Up to few days ago I was
>>> able to run my program without any errors. However several test cases on
>>> the cluster in last few days exposed segmentation fault when I try to form
>>> indexed type on some specific matrix configurations.
>>>
>>> The code that forms indexed type is as follows:
>>>
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main(int argc, char **argv) {
>>>
>>> int Nx = 800;
>>> int Ny = 640;
>>> int Nz = 480;
>>> int gsize;
>>> int i, j;
>>>
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_size(MPI_COMM_WORLD, &gsize);
>>>
>>> printf("GSIZE: %d\n", gsize);
>>>
>>> MPI_Datatype double_complex_type;
>>> MPI_Datatype block_send_complex_type;
>>>
>>> int * send_displ = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>> int * send_blocklen = (int *) malloc(Nx * Ny/gsize * sizeof(int));
>>>
>>> MPI_Type_contiguous(2, MPI_DOUBLE, &double_complex_type);
>>> MPI_Type_commit(&double_complex_type);
>>>
>>> for (i = Ny/gsize - 1; i >= 0 ; i--) {
>>> for (j = 0; j < Nx; j++) {
>>> send_displ[(Ny/gsize - 1 - i) * Nx + j] = i * Nz + j * Ny *
>>> Nz;
>>> send_blocklen[(Ny/gsize - 1 - i) * Nx + j] = Nz;
>>> }
>>> }
>>>
>>> MPI_Type_indexed(Nx * Ny/gsize, send_blocklen, send_displ,
>>> double_complex_type, &block_send_complex_type);
>>> MPI_Type_commit(&block_send_complex_type);
>>>
>>> free(send_displ);
>>> free(send_blocklen);
>>>
>>> MPI_Finalize();
>>> }
>>>
>>> Values of the Nx, Ny and Nz respectively are 800, 640 and 480. Value of
>>> gsize for this test was 1 (simulation of MPI program on 1 node). The node
>>> has 32GB of RAM and no other memory has been allocated (only this code has
>>> been run).
>>>
>>> In code basically I'm creating double_complex_type to represent complex
>>> number (2 contiguous MPI_DOUBLE) values. The whole matrix has 800 * 640 *
>>> 480 of these values and I'm trying to catch these values in the indexed
>>> type. One indexed type block length is the whole Nz "rod" while ordering of
>>> these "rods" in displacements array is given by the formula i * Nz + j * Ny
>>> * Nz. Basically displacements start from top row, and left column of the 3D
>>> matrix. Then I gradually sweep to the right sight of that top row, then go
>>> to one row below sweep to the right side and so on until the bottom row.
>>>
>>> The strange thing is that this formula and algorithm *WORK* if I put
>>> MPI_DOUBLE type instead of derived complex type (1 instead of 2 in
>>> MPI_TYPE_CONTIGIOUS). Also this formula *WORKS* if I put 1 for Nz
>>> dimension instead of 480. However if I change Nz to even 2 I get
>>> segmentation fault error in the MPI_Type_commit call.
>>>
>>> I checked all of the displacements and they seem fine. There is no
>>> overlapping of displacements or going under 0 or over extent of the formed
>>> indexed type. Also the size of the datatype is below 4GB (which is I
>>> believe limit of MPI datatypes (since MPI_Type_size function returns int *
>>> ). Also I believe amount of memory is not an issue as even if I put Nz to
>>> be 2, I get the same segmentation fault error, and the node has 32