date:20171107

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-07 Thread Konstantinos Konstantinidis

OK, I started implementing the above Allgather() idea without success
(segmentation fault). So I will post the problematic lines hare:

* comm.Allgather(&(endata.size), 1, MPI::UNSIGNED_LONG_LONG,
&(endata_rcv.size), 1, MPI::UNSIGNED_LONG_LONG);*
* endata_rcv.data = new unsigned char[endata_rcv.size*lineSize];*
* comm.Allgather(&(endata.data), endata.size*lineSize, MPI::UNSIGNED_CHAR,
&(endata_rcv.data), endata_rcv.size*lineSize, MPI::UNSIGNED_CHAR);*
* delete [] endata.data;*

The idea (as it was also for the broadcasts) is first to transmit the data
size as an unsigned long long integer, so that the receivers will reserve
the required memory for the actual data to be transmitted after that. To my
understanding, the problem is that each broadcasted data, let D(s,G), as I
explained in the previous email is not only different but also has
different size (in general). That's because if I replace the 3rd line with

* comm.Allgather(&(endata.data), 1, MPI::UNSIGNED_CHAR, &(endata_rcv.data),
1, MPI::UNSIGNED_CHAR);*

seems to work without seg. fault but this is pointless for me since I don't
want only 1 char to be transmitted. So if we see the previous image I
posted, imagine that the red, green and blue squares are different in size?
Can Allgather() even work then? If no, do you suggest anything else or I am
trapped in using the MPI_Bcast() as shown in Option 1?

On Mon, Nov 6, 2017 at 8:58 AM, George Bosilca  wrote:

> On Sun, Nov 5, 2017 at 10:23 PM, Konstantinos Konstantinidis <
> kostas1...@gmail.com> wrote:
>
>> Hi George,
>>
>> First, let me note that the cost of q^(k-1)]*(q-1) communicators was
>> fine for the values of parameters q,k I am working with. Also, the whole
>> point of speeding up the shuffling phase is trying to reduce this number
>> even more (compared to already known implementations) which is a major
>> concern of my project. But thanks for pointing that out. Btw, do you know
>> what is the maximum such number in MPI?
>>
>
> Last time I run into such troubles these limits were: 2k for MVAPICH, 16k
> for MPICH and 2^30-1 for OMPI (all positive signed 23 bits integers). It
> might have changed meanwhile.
>
>
>> Now to the main part of the question, let me clarify that I have 1
>> process per machine. I don't know if this is important here but my way of
>> thinking is that we have a big text file and each process will have to work
>> on some chunks of it (like chapters of a book). But each process resides in
>> an machine with some RAM which is able to handle a specific amount of work
>> so if you generate many processes per machine you must have fewer book
>> chapters per process than before. Thus, I wanted to avoid thinking in the
>> process-level rather than machine-level with the RAM limitations.
>>
>> Now to the actual shuffling, here is what I am currently doing (Option 1):
>>
>> Let's denote the data that slave s has to send to the slaves in group G
>> as D(s,G).
>>
>> *for each slave s in 1,2,...,K{*
>>
>> *for each group G that s participates into{*
>>
>> *if (my rank is s){*
>> *MPI_Bcast(send data D(s,G))*
>> *}else if(my rank is in group G)*
>> *MPI_Bcast(get data D(s,G))*
>> *}else{*
>> *   Do nothing*
>> *}*
>>
>> *}*
>>
>> *MPI::COMM_WORLD.Barrier();*
>>
>> *}*
>>
>> What I suggested before to speedup things (Option 2) is:
>>
>> *for each set {G(1),G(2),...,G(q-1)} of q-1 disjoint groups{ *
>>
>> *for each slave s in G(1)*
>> *if (my rank is s){*
>> *MPI_Bcast(send data D(s,G(1)))*
>> *}else if(**my rank is in** group G(1))*
>> *MPI_Bcast(get data D(s,G(1)))*
>> *}else{*
>> *   Do nothing*
>> *}*
>> *}*
>>
>> *for each slave s in G(2)*
>> *if (my rank is s){*
>> *MPI_Bcast(send data D(s,G(2)))*
>> *}else if(**my rank is in** G(2))*
>> *MPI_Bcast(get data D(s,G(2)))*
>> *}else{*
>> *   Do nothing*
>> *}*
>> *}*
>>
>> *...*
>>
>> *for each slave s in G(q-1)*
>> *if (my rank is s){*
>> *MPI_Bcast(send data D(s,G(q-1)))*
>> *}else if(**my rank is in** G(q-1))*
>> *MPI_Bcast(get data D(s,G(q-1)))*
>> *}else{*
>> *   Do nothing*
>> *}*
>> *}*
>>
>> *MPI::COMM_WORLD.Barrier();*
>>
>> *}*
>>
>> My hope was that I could implement Option 2 (in some way without copying
>> and pasting the same code q-1 times every time I change q) and that this
>> could bring a speedup of q-1 compared to Option 1 by having these groups
>> communicate in parallel. Right, now I am trying to find a way to identify
>> these sets of groups based on my implementation, which involves some
>> abstract algebra but for now let's assume that I can find them in an
>> efficient manner.
>>
>> Let me emphasize that each broadcast sends different actual data. There
>> are no two broadcasts that send the same D(s,G).
>>
>> Fin

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-07 Thread George Bosilca

If each process send a different amount of data, then the operation should
be an allgatherv. This also requires that you know the amount each process
will send, so you will need a allgather. Schematically the code should look
like the following:

long bytes_send_count = endata.size * sizeof(long);  // compute the amount
of data sent by this process
long* recv_counts = (long*)malloc(comm_size * sizeof(long));  // allocate
buffer to receive the amounts from all peers
int displs = (int*)malloc(comm_size * sizeof(int));  // allocate buffer to
compute the displacements for each peer
MPI_Allgather( &bytes_send_count, 1, MPI_LONG, recv_counts, 1, MPI_LONG,
comm);  // exchange the amount of sent data
long total = 0;  // we need a total amount of data to be received
for( int i = 0; i < comm_size; i++) {
displs[i] = total;  // update the displacements
total += recv_counts[i];   // and the total count
}
char* recv_buf = (char*)malloc(total * sizeof(char));  // prepare buffer
for the allgatherv
MPI_Allgatherv( &(endata.data), endata.size*sizeof(char),
MPI_UNSIGNED_CHAR, recv_buf, recv_counts, displs, MPI_UNSIGNED_CHAR, comm);

George.



On Tue, Nov 7, 2017 at 4:23 AM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:

> OK, I started implementing the above Allgather() idea without success
> (segmentation fault). So I will post the problematic lines hare:
>
> * comm.Allgather(&(endata.size), 1, MPI::UNSIGNED_LONG_LONG,
> &(endata_rcv.size), 1, MPI::UNSIGNED_LONG_LONG);*
> * endata_rcv.data = new unsigned char[endata_rcv.size*lineSize];*
> * comm.Allgather(&(endata.data), endata.size*lineSize, MPI::UNSIGNED_CHAR,
> &(endata_rcv.data), endata_rcv.size*lineSize, MPI::UNSIGNED_CHAR);*
> * delete [] endata.data;*
>
> The idea (as it was also for the broadcasts) is first to transmit the data
> size as an unsigned long long integer, so that the receivers will reserve
> the required memory for the actual data to be transmitted after that. To my
> understanding, the problem is that each broadcasted data, let D(s,G), as I
> explained in the previous email is not only different but also has
> different size (in general). That's because if I replace the 3rd line with
>
> * comm.Allgather(&(endata.data), 1, MPI::UNSIGNED_CHAR,
> &(endata_rcv.data), 1, MPI::UNSIGNED_CHAR);*
>
> seems to work without seg. fault but this is pointless for me since I
> don't want only 1 char to be transmitted. So if we see the previous image I
> posted, imagine that the red, green and blue squares are different in size?
> Can Allgather() even work then? If no, do you suggest anything else or I am
> trapped in using the MPI_Bcast() as shown in Option 1?
>
> On Mon, Nov 6, 2017 at 8:58 AM, George Bosilca 
> wrote:
>
>> On Sun, Nov 5, 2017 at 10:23 PM, Konstantinos Konstantinidis <
>> kostas1...@gmail.com> wrote:
>>
>>> Hi George,
>>>
>>> First, let me note that the cost of q^(k-1)]*(q-1) communicators was
>>> fine for the values of parameters q,k I am working with. Also, the whole
>>> point of speeding up the shuffling phase is trying to reduce this number
>>> even more (compared to already known implementations) which is a major
>>> concern of my project. But thanks for pointing that out. Btw, do you know
>>> what is the maximum such number in MPI?
>>>
>>
>> Last time I run into such troubles these limits were: 2k for MVAPICH, 16k
>> for MPICH and 2^30-1 for OMPI (all positive signed 23 bits integers). It
>> might have changed meanwhile.
>>
>>
>>> Now to the main part of the question, let me clarify that I have 1
>>> process per machine. I don't know if this is important here but my way of
>>> thinking is that we have a big text file and each process will have to work
>>> on some chunks of it (like chapters of a book). But each process resides in
>>> an machine with some RAM which is able to handle a specific amount of work
>>> so if you generate many processes per machine you must have fewer book
>>> chapters per process than before. Thus, I wanted to avoid thinking in the
>>> process-level rather than machine-level with the RAM limitations.
>>>
>>> Now to the actual shuffling, here is what I am currently doing (Option
>>> 1):
>>>
>>> Let's denote the data that slave s has to send to the slaves in group G
>>> as D(s,G).
>>>
>>> *for each slave s in 1,2,...,K{*
>>>
>>> *for each group G that s participates into{*
>>>
>>> *if (my rank is s){*
>>> *MPI_Bcast(send data D(s,G))*
>>> *}else if(my rank is in group G)*
>>> *MPI_Bcast(get data D(s,G))*
>>> *}else{*
>>> *   Do nothing*
>>> *}*
>>>
>>> *}*
>>>
>>> *MPI::COMM_WORLD.Barrier();*
>>>
>>> *}*
>>>
>>> What I suggested before to speedup things (Option 2) is:
>>>
>>> *for each set {G(1),G(2),...,G(q-1)} of q-1 disjoint groups{ *
>>>
>>> *for each slave s in G(1)*
>>> *if (my rank is s){*
>>> *MPI_Bcast(send data D(s,G(1)))*
>>> *}else if(**my rank is in** group G(1))*
>>> *

[OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

2017-11-07 Thread Nikolas Antolin

Hello,

In debugging a test of an application, I recently came across odd behavior
for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
acknowledged by the process output, the mpirun process failed to exit. I
was able to duplicate this behavior on multiple machines with OpenMPI
versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:

#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
int rank;

MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("I am process number %d\n", rank);
MPI_Abort(MPI_COMM_WORLD, 3);
return 0;
}

Is this a bug or a feature? Does this behavior exist in OpenMPI versions
2.0 and 3.0?

Best,
Nik
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-07 Thread Konstantinos Konstantinidis

OK, I will try to explain a few more things about the shuffling and I have
attached only specific excerpts of the code to avoid confusion. I have
added many comments.

First, let me note that this project is an implementation of the Terasort
benchmark with a master node which assigns jobs to the slaves and
communicates with them after each phase to get measurements.

The file shuffle_before.cc shows how I am doing the shuffling up to now and
the shuffle_after.cc the progress I made so far switching to Allgatherv().

I have also included the code that measures time and data size since it's
crucial for me to check if I have rate speedup.

Some questions I have are:
1. At shuffle_after.cc:61 why do we reserve *comm.Get_size() *entries for*
recv_counts* and not *comm.Get_size()-1 *? For example if I am rank k what
is the point of *recv_counts[k-1]*? I guess that rank k also receives data
from himself but we can ignore it, right?

2. My next concern is about the structure of the buffer *recv_buf[]*. The
documentation says that the data is stored there ordered. So I assume that
it's stored as segments of char* ordered by rank and the way to distinguish
them is to chop the whole data based on *recv_counts[]*. So let G = {g1,
g2, ..., gN} a group that exchanges data. Let's take slave g2: Then
segment *recv_buf[0
until **recv_counts[0]-1**] *is what g2 received from g1,
*recv_buf[**recv_counts[0]
until **recv_counts[1]-1**] *is what g2 received from himself (ignore it),
and so on... Is this idea correct?

So I have written a sketch of the code at shuffle_after.cc which I also try
to explain how the master will compute rate, but at least I want to make it
work.

I know that this discussion is getting long but if you have some free time
can you take a look at it?

Thanks,
Kostas

On Tue, Nov 7, 2017 at 9:34 AM, George Bosilca  wrote:

> If each process send a different amount of data, then the operation should
> be an allgatherv. This also requires that you know the amount each process
> will send, so you will need a allgather. Schematically the code should look
> like the following:
>
> long bytes_send_count = endata.size * sizeof(long);  // compute the amount
> of data sent by this process
> long* recv_counts = (long*)malloc(comm_size * sizeof(long));  // allocate
> buffer to receive the amounts from all peers
> int displs = (int*)malloc(comm_size * sizeof(int));  // allocate buffer to
> compute the displacements for each peer
> MPI_Allgather( &bytes_send_count, 1, MPI_LONG, recv_counts, 1, MPI_LONG,
> comm);  // exchange the amount of sent data
> long total = 0;  // we need a total amount of data to be received
> for( int i = 0; i < comm_size; i++) {
> displs[i] = total;  // update the displacements
> total += recv_counts[i];   // and the total count
> }
> char* recv_buf = (char*)malloc(total * sizeof(char));  // prepare buffer
> for the allgatherv
> MPI_Allgatherv( &(endata.data), endata.size*sizeof(char),
> MPI_UNSIGNED_CHAR, recv_buf, recv_counts, displs, MPI_UNSIGNED_CHAR, comm);
>
> George.
>
>
>
> On Tue, Nov 7, 2017 at 4:23 AM, Konstantinos Konstantinidis <
> kostas1...@gmail.com> wrote:
>
>> OK, I started implementing the above Allgather() idea without success
>> (segmentation fault). So I will post the problematic lines hare:
>>
>> * comm.Allgather(&(endata.size), 1, MPI::UNSIGNED_LONG_LONG,
>> &(endata_rcv.size), 1, MPI::UNSIGNED_LONG_LONG);*
>> * endata_rcv.data = new unsigned char[endata_rcv.size*lineSize];*
>> * comm.Allgather(&(endata.data), endata.size*lineSize,
>> MPI::UNSIGNED_CHAR, &(endata_rcv.data), endata_rcv.size*lineSize,
>> MPI::UNSIGNED_CHAR);*
>> * delete [] endata.data;*
>>
>> The idea (as it was also for the broadcasts) is first to transmit the
>> data size as an unsigned long long integer, so that the receivers will
>> reserve the required memory for the actual data to be transmitted after
>> that. To my understanding, the problem is that each broadcasted data, let
>> D(s,G), as I explained in the previous email is not only different but also
>> has different size (in general). That's because if I replace the 3rd line
>> with
>>
>> * comm.Allgather(&(endata.data), 1, MPI::UNSIGNED_CHAR,
>> &(endata_rcv.data), 1, MPI::UNSIGNED_CHAR);*
>>
>> seems to work without seg. fault but this is pointless for me since I
>> don't want only 1 char to be transmitted. So if we see the previous image I
>> posted, imagine that the red, green and blue squares are different in size?
>> Can Allgather() even work then? If no, do you suggest anything else or I am
>> trapped in using the MPI_Bcast() as shown in Option 1?
>>
>> On Mon, Nov 6, 2017 at 8:58 AM, George Bosilca 
>> wrote:
>>
>>> On Sun, Nov 5, 2017 at 10:23 PM, Konstantinos Konstantinidis <
>>> kostas1...@gmail.com> wrote:
>>>
 Hi George,

 First, let me note that the cost of q^(k-1)]*(q-1) communicators was
 fine for the values of parameters q,k I am working with. Also, the whole
 point of speeding up the shuffling ph

Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

2017-11-07 Thread Tru Huynh

Hi,

On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
> Hello,
> 
> In debugging a test of an application, I recently came across odd behavior
> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
> acknowledged by the process output, the mpirun process failed to exit. I
> was able to duplicate this behavior on multiple machines with OpenMPI
> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
> 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char **argv)
> {
> int rank;
> 
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> 
> printf("I am process number %d\n", rank);
> MPI_Abort(MPI_COMM_WORLD, 3);
> return 0;
> }
> 
> Is this a bug or a feature? Does this behavior exist in OpenMPI versions
> 2.0 and 3.0?
I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
3.0.0 and the program seems to run fine.

[tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do 


 module purge && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; 
mpirun  -n 2 ./aa-$i ; done 
 
linux-vdso.so.1 =>  (0x7ffe115bd000)
libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 
(0x7f40d7b4a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000)
libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000)
libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 
(0x7f40d72b8000)
libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 
(0x7f40d6fd9000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000)
libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000)
librt.so.1 => /lib64/librt.so.1 (0x7f40d69c)
libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000)
libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000)
/lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000)
I am process number 1
I am process number 0
--
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
[borma.bis.pasteur.fr:08511] 1 more process has sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:08511] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
linux-vdso.so.1 =>  (0x7fffaabcd000)
libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 
(0x7f5bcee39000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000)
libc.so.6 => /lib64/libc.so.6 (0x7f5bce823000)
libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 
(0x7f5bce5a)
libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 
(0x7f5bce2a7000)
libdl.so.2 => /lib64/libdl.so.2 (0x7f5bce0a3000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x7f5bcde97000)
libudev.so.1 => /lib64/libudev.so.1 (0x7f5bcde81000)
librt.so.1 => /lib64/librt.so.1 (0x7f5bcdc79000)
libm.so.6 => /lib64/libm.so.6 (0x7f5bcd977000)
libutil.so.1 => /lib64/libutil.so.1 (0x7f5bcd773000)
/lib64/ld-linux-x86-64.so.2 (0x55718df01000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f5bcd55d000)
libcap.so.2 => /lib64/libcap.so.2 (0x7f5bcd357000)
libdw.so.1 => /lib64/libdw.so.1 (0x7f5bcd11)
libattr.so.1 => /lib64/libattr.so.1 (0x7f5bccf0b000)
libelf.so.1 => /lib64/libelf.so.1 (0x7f5bcccf2000)
libz.so.1 => /lib64/libz.so.1 (0x7f5bccadc000)
liblzma.so.5 => /lib64/liblzma.so.5 (0x7f5bcc8b6000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x7f5bcc6a5000)
I am process number 1
I am process number 0
--
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
[borma.bis.pasteur.fr:08534] 1 more process has sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:08534] Set MCA parameter "orte_base_help_aggregate" to 0 
to see a

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-07 Thread George Bosilca

On Tue, Nov 7, 2017 at 6:09 PM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:

> OK, I will try to explain a few more things about the shuffling and I have
> attached only specific excerpts of the code to avoid confusion. I have
> added many comments.
>
> First, let me note that this project is an implementation of the Terasort
> benchmark with a master node which assigns jobs to the slaves and
> communicates with them after each phase to get measurements.
>
> The file shuffle_before.cc shows how I am doing the shuffling up to now
> and the shuffle_after.cc the progress I made so far switching to
> Allgatherv().
>
> I have also included the code that measures time and data size since it's
> crucial for me to check if I have rate speedup.
>
> Some questions I have are:
> 1. At shuffle_after.cc:61 why do we reserve *comm.Get_size() *entries for*
> recv_counts* and not *comm.Get_size()-1 *? For example if I am rank k
> what is the point of *recv_counts[k-1]*? I guess that rank k also
> receives data from himself but we can ignore it, right?
>

No, you cant simply ignore it ;) allgather copies the same amount of data
to all processes in the communicator ... including itself. If you want to
argue about this reach out to the MPI standardization body ;)

>
> 2. My next concern is about the structure of the buffer *recv_buf[]*. The
> documentation says that the data is stored there ordered. So I assume that
> it's stored as segments of char* ordered by rank and the way to distinguish
> them is to chop the whole data based on *recv_counts[]*. So let G = {g1,
> g2, ..., gN} a group that exchanges data. Let's take slave g2: Then segment 
> *recv_buf[0
> until **recv_counts[0]-1**] *is what g2 received from g1, 
> *recv_buf[**recv_counts[0]
> until **recv_counts[1]-1**] *is what g2 received from himself (ignore
> it), and so on... Is this idea correct?
>

I don't know what documentation says "ordered", there is no such wording in
the MPI standard. By carefully playing with the receive datatype I can do
anything I want, including interleaving data from the different peers. But
this is not what you are trying to do here.

The placement in memory you describe is true if you use the displacement
array as crafted in my example. The entry i in the displacement array
specifies the displacement (relative to recvbuf) at which to place the
incoming data from process i, so where you receive data has nothing to do
with the amount you receive but with what you have in the displacement
array.

>
> So I have written a sketch of the code at shuffle_after.cc which I also
> try to explain how the master will compute rate, but at least I want to
> make it work.
>

This code looks OK to me. I would however:

1. Remove the barriers on the workerComm. If the order of the communicators
in the multicastGroupMap is identical on all processes (including
communicators where they do not belong to) then the barriers are
superfluous. However, if you try to protect your processes from starting
the allgather collective to early, then you can replace the barrier
on workerComm with one on mcComm.

2. The check "ns.find(rank) != ns.end()" should be equivalent to "mcComm ==
MPI_COMM_NULL" if I understand your code correctly.

3. This is an optimization. Remove all time exchanges outside the main
loop. Instead of sending them one-by-one, keep them in an array and send
the entire array once per CodedWorker::execShuffle, possible via an
MPI_Allgatherv toward the master process in MPI_COMM_WORLD (in this case
you can convert the "long long" into a double to facilitate the collective).

  George.

>
> I know that this discussion is getting long but if you have some free time
> can you take a look at it?
>
> Thanks,
> Kostas
>
>
> On Tue, Nov 7, 2017 at 9:34 AM, George Bosilca 
> wrote:
>
>> If each process send a different amount of data, then the operation
>> should be an allgatherv. This also requires that you know the amount each
>> process will send, so you will need a allgather. Schematically the code
>> should look like the following:
>>
>> long bytes_send_count = endata.size * sizeof(long);  // compute the
>> amount of data sent by this process
>> long* recv_counts = (long*)malloc(comm_size * sizeof(long));  // allocate
>> buffer to receive the amounts from all peers
>> int displs = (int*)malloc(comm_size * sizeof(int));  // allocate buffer
>> to compute the displacements for each peer
>> MPI_Allgather( &bytes_send_count, 1, MPI_LONG, recv_counts, 1, MPI_LONG,
>> comm);  // exchange the amount of sent data
>> long total = 0;  // we need a total amount of data to be received
>> for( int i = 0; i < comm_size; i++) {
>> displs[i] = total;  // update the displacements
>> total += recv_counts[i];   // and the total count
>> }
>> char* recv_buf = (char*)malloc(total * sizeof(char));  // prepare buffer
>> for the allgatherv
>> MPI_Allgatherv( &(endata.data), endata.size*sizeof(char),
>> MPI_UNSIGNED_CHAR, recv_buf, recv_counts, displ

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

[OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

6 matches

Site Navigation

Mail list logo

Footer information