The provided code sample is not correct, thus the real issue has nothing to do 
with the amount of data to be handled by the MPI implementation. Scale the 
amount to allocate down to 2^27 and the issue will still persist…

Your MPI_Allgatherv operation receives recvCount[i]*MPI_INT from each peer and 
place it in memory starting from displacement displ[i] from rbuf. Thus, in 
order for this application to work as you expect the receive buffer should be 
large enough to contain all the data sent by __all__ peers. This is not the 
case in this application. In other words the rbuf must be of 
size*nproc*sizeof(MPI_INT) for your application to be correct. 

Your application works for 2 processes as 2^28 * 2 = 2 ^ 29, under the limit of 
allocated memory of 2^30-1. For larger amounts 2^29 * 2 = 2 ^ 30 is larger than 
the allocated memory 2^30-1, and similar when you increase the number of 
processes.

  George.

On Aug 6, 2013, at 03:37 , Jeff Hammond <jeff.scie...@gmail.com> wrote:

> As your code prints OK without verifying the correctness of the
> result, you are only verifying the lack of segfault in OpenMPI, which
> is necessary but not sufficient for correct execution.
> 
> It is not uncommon for MPI implementations to have issues near
> count=2^31.  I can't speak to the extent to which OpenMPI is
> rigorously correct in this respect.  I've yet to find an
> implementation which is end-to-end count-safe, which includes support
> for zettabyte buffers via MPI datatypes for collectives,
> point-to-point, RMA and IO.
> 
> The easy solution for your case is to chop MPI_Allgatherv into
> multiple calls.  In the case where the array of send counts is near
> uniform, you can do N MPI_Allgather calls and 1 MPI_Allgatherv, which
> might help performance in some cases.
> 
> Since most MPI implementations use Send/Recv under the hood for
> collectives, you can aid in the debugging of this issue by testing
> MPI_Send/Recv for count->2^31.
> 
> Best,
> 
> Jeff
> 
> On Mon, Aug 5, 2013 at 6:48 PM, ryan He <ryan.qing...@gmail.com> wrote:
>> Dear All,
>> 
>> I write a simple test code to use MPI_Allgatherv function. The problem comes
>> when
>> the send buf size becomes relatively big.
>> 
>> When Bufsize = 2^28 – 1, run on 4 processors. OK
>> When Bufsize = 2^28, run on 4 processors. Error
>> [btl_tcp_frag.c:209:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
>> error (0xffffffff85f526f8, 2147483592) Bad address(1)
>> 
>> When Bufsize =2^29-1, run on 2 processors. OK
>> When Bufsize = 2^29, run on 2 processors. Error
>> [btl_tcp_frag.c:209:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
>> error (0xffffffff964605d0, 2147483632) Bad address(1)
>> 
>> Bufsize is not that close to int limit, but readv in mca_btl_tcp_frag_recv
>> has size close to 2147483647. Does anyone have idea why the error comes? Any
>> suggestion to solve or avoid this problem?
>> 
>> The simple test code is attached below:
>> 
>> #include <stdio.h>
>> 
>> #include <stdlib.h>
>> 
>> #include <string.h>
>> 
>> #include <unistd.h>
>> 
>> #include <time.h>
>> 
>> #include "mpi.h"
>> 
>> int main(int argc, char ** argv)
>> 
>> {
>> 
>> int myid,nproc;
>> 
>> long i,j;
>> 
>> long size;
>> 
>> long bufsize;
>> 
>> int *rbuf;
>> 
>> int *sbuf;
>> 
>> char hostname[MPI_MAX_PROCESSOR_NAME];
>> 
>> int len;
>> 
>> size = (long) 2*1024*1024*1024-1;
>> 
>> MPI_Init(&argc, &argv);
>> 
>> MPI_Comm_rank(MPI_COMM_WORLD, &myid);
>> 
>> MPI_Comm_size(MPI_COMM_WORLD, &nproc);
>> 
>> MPI_Get_processor_name(hostname, &len);
>> 
>> printf("I am process %d with pid: %d at %s\n",myid,getpid(),hostname);
>> 
>> sleep(2);
>> 
>> 
>> if (myid == 0)
>> 
>> printf("size : %ld\n",size);
>> 
>> sbuf = (int *) calloc(size,sizeof(MPI_INT));
>> 
>> if (sbuf == NULL) {
>> 
>> printf("fail to allocate memory of sbuf\n");
>> 
>> exit(1);
>> 
>> }
>> 
>> rbuf = (int *) calloc(size,sizeof(MPI_INT));
>> 
>> if (rbuf == NULL) {
>> 
>> printf("fail to allocate memory of rbuf\n");
>> 
>> exit(1);
>> 
>> }
>> 
>> int *recvCount = calloc(nproc,sizeof(int));
>> 
>> int *displ = calloc(nproc,sizeof(int));
>> 
>> bufsize = 268435456; //which is 2^28
>> 
>> for(i=0;i<nproc;++i) {
>> 
>> recvCount[i] = bufsize;
>> 
>> displ[i] = bufsize*i;
>> 
>> }
>> 
>> for (i=0;i<bufsize;++i)
>> 
>> sbuf[i] = myid+i;
>> 
>> printf("buffer size: %ld recvCount[0]:%d last displ
>> index:%d\n",bufsize,recvCount[0],displ[nproc-1]);
>> 
>> fflush(stdout);
>> 
>> 
>> MPI_Allgatherv(sbuf,recvCount[0], MPI_INT,rbuf,recvCount,displ,MPI_INT,
>> 
>> MPI_COMM_WORLD);
>> 
>> 
>> printf("OK\n");
>> 
>> fflush(stdout);
>> 
>> MPI_Finalize();
>> 
>> return 0;
>> 
>> }
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to