[OMPI users] How to reduce Isend & Irecv bandwidth?

2013-05-01 Thread Thomas Watson
Hi,

I have a program where each MPI rank hosts a set of data blocks. After
doing computation over *some of* its local data blocks, each MPI rank needs
to exchange data with other ranks. Note that the computation may involve
only a subset of the data blocks on a MPI rank. The data exchange is
achieved at each MPI rank through Isend and Irecv and then Waitall to
complete the requests. Each pair of Isend and Irecv exchanges a
corresponding pair of data blocks at different ranks. Right now, we do
Isend/Irecv for EVERY block!

The idea is that because the computation at a rank may only involves a
subset of blocks, we could mark those blocks as dirty during the
computation. And to reduce data exchange bandwidth, we could only exchanges
those *dirty* pairs across ranks.

The problem is: if a rank does not compute on a block 'm', and if it does
not call Isend for 'm', then the receiving rank must somehow know this and
either a) does not call Irecv for 'm' as well, or b) let Irecv for 'm' fail
gracefully.

My questions are:
1. how Irecv will behave (actually how MPI_Waitall will behave) if the
corresponding Isend is missing?

2. If we still post Isend for 'm', but because we really do not need to
send any data for 'm', can I just set a "flag" in Isend so that MPI_Waitall
on the receiving side will "cancel" the corresponding Irecv immediately?
For example, I can set the count in Isend to 0, and on the receiving side,
when MPI_Waitall see a message with empty payload, it reclaims the
corresponding Irecv? In my code, the correspondence between a pair of Isend
and Irecv is established by a matching TAG.

Thanks!

Jacky


Re: [OMPI users] How to reduce Isend & Irecv bandwidth?

2013-05-01 Thread Gus Correa

Maybe start the data exchange by sending a (presumably short)
list/array/index-function of the dirty/not-dirty blocks status
(say, 0=not-dirty,1=dirty),
then putting if conditionals before the Isend/Irecv so that only
dirty blocks are exchanged?

I hope this helps,
Gus Correa



On 05/01/2013 01:28 PM, Thomas Watson wrote:

Hi,

I have a program where each MPI rank hosts a set of data blocks. After
doing computation over *some of* its local data blocks, each MPI rank
needs to exchange data with other ranks. Note that the computation may
involve only a subset of the data blocks on a MPI rank. The data
exchange is achieved at each MPI rank through Isend and Irecv and then
Waitall to complete the requests. Each pair of Isend and Irecv exchanges
a corresponding pair of data blocks at different ranks. Right now, we do
Isend/Irecv for EVERY block!

The idea is that because the computation at a rank may only involves a
subset of blocks, we could mark those blocks as dirty during the
computation. And to reduce data exchange bandwidth, we could only
exchanges those *dirty* pairs across ranks.

The problem is: if a rank does not compute on a block 'm', and if it
does not call Isend for 'm', then the receiving rank must somehow know
this and either a) does not call Irecv for 'm' as well, or b) let Irecv
for 'm' fail gracefully.

My questions are:
1. how Irecv will behave (actually how MPI_Waitall will behave) if the
corresponding Isend is missing?

2. If we still post Isend for 'm', but because we really do not need to
send any data for 'm', can I just set a "flag" in Isend so that
MPI_Waitall on the receiving side will "cancel" the corresponding Irecv
immediately? For example, I can set the count in Isend to 0, and on the
receiving side, when MPI_Waitall see a message with empty payload, it
reclaims the corresponding Irecv? In my code, the correspondence between
a pair of Isend and Irecv is established by a matching TAG.

Thanks!

Jacky


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] How to reduce Isend & Irecv bandwidth?

2013-05-01 Thread Thomas Watson
Hi Gus,

Thanks for your suggestion!

The problem of this two-phased data exchange is as follows. Each rank can
have data blocks that will be exchanged to potentially all other ranks. So
if a rank needs to tell all the other ranks about which blocks to receive,
it would require an all-to-all collective communication during phase one
(e.g., MPI_Gatherallv). Because such collective communication is blocking
in current stable OpenMPI (MPI-2), it would have a negative impact on
scalability of the application, especially when we have a large number of
MPI ranks. This negative impact would not be compensated by the bandwidth
saved :-)

What I really need is something like this: Isend sets count to 0 if a block
is not dirty. On the receiving side, MPI_Waitall deallocates the
corresponding Irecv request immediately and sets the Irecv request handle
to MPI_REQUEST_NULL as if it were a normal Irecv. I am wondering if someone
could confirm this behavior with me? I could do an experiment on this too...

Regards,

Jacky




On Wed, May 1, 2013 at 3:46 PM, Gus Correa  wrote:

> Maybe start the data exchange by sending a (presumably short)
> list/array/index-function of the dirty/not-dirty blocks status
> (say, 0=not-dirty,1=dirty),
> then putting if conditionals before the Isend/Irecv so that only
> dirty blocks are exchanged?
>
> I hope this helps,
> Gus Correa
>
>
>
>
> On 05/01/2013 01:28 PM, Thomas Watson wrote:
>
>> Hi,
>>
>> I have a program where each MPI rank hosts a set of data blocks. After
>> doing computation over *some of* its local data blocks, each MPI rank
>> needs to exchange data with other ranks. Note that the computation may
>> involve only a subset of the data blocks on a MPI rank. The data
>> exchange is achieved at each MPI rank through Isend and Irecv and then
>> Waitall to complete the requests. Each pair of Isend and Irecv exchanges
>> a corresponding pair of data blocks at different ranks. Right now, we do
>> Isend/Irecv for EVERY block!
>>
>> The idea is that because the computation at a rank may only involves a
>> subset of blocks, we could mark those blocks as dirty during the
>> computation. And to reduce data exchange bandwidth, we could only
>> exchanges those *dirty* pairs across ranks.
>>
>> The problem is: if a rank does not compute on a block 'm', and if it
>> does not call Isend for 'm', then the receiving rank must somehow know
>> this and either a) does not call Irecv for 'm' as well, or b) let Irecv
>> for 'm' fail gracefully.
>>
>> My questions are:
>> 1. how Irecv will behave (actually how MPI_Waitall will behave) if the
>> corresponding Isend is missing?
>>
>> 2. If we still post Isend for 'm', but because we really do not need to
>> send any data for 'm', can I just set a "flag" in Isend so that
>> MPI_Waitall on the receiving side will "cancel" the corresponding Irecv
>> immediately? For example, I can set the count in Isend to 0, and on the
>> receiving side, when MPI_Waitall see a message with empty payload, it
>> reclaims the corresponding Irecv? In my code, the correspondence between
>> a pair of Isend and Irecv is established by a matching TAG.
>>
>> Thanks!
>>
>> Jacky
>>
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>>
>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>


Re: [OMPI users] How to reduce Isend & Irecv bandwidth?

2013-05-01 Thread Gus Correa

Hi Thomas/Jacky

Maybe using MPI_Probe (and maybe also MPI_Cancel)
to probe the message size,
and receive only those with size>0?
Anyway, I'm just code-guessing.

I hope it helps,
Gus Correa

On 05/01/2013 05:14 PM, Thomas Watson wrote:

Hi Gus,

Thanks for your suggestion!

The problem of this two-phased data exchange is as follows. Each rank
can have data blocks that will be exchanged to potentially all other
ranks. So if a rank needs to tell all the other ranks about which blocks
to receive, it would require an all-to-all collective communication
during phase one (e.g., MPI_Gatherallv). Because such collective
communication is blocking in current stable OpenMPI (MPI-2), it would
have a negative impact on scalability of the application, especially
when we have a large number of MPI ranks. This negative impact would not
be compensated by the bandwidth saved :-)

What I really need is something like this: Isend sets count to 0 if a
block is not dirty. On the receiving side, MPI_Waitall deallocates the
corresponding Irecv request immediately and sets the Irecv request
handle to MPI_REQUEST_NULL as if it were a normal Irecv. I am wondering
if someone could confirm this behavior with me? I could do an experiment
on this too...

Regards,

Jacky




On Wed, May 1, 2013 at 3:46 PM, Gus Correa mailto:g...@ldeo.columbia.edu>> wrote:

Maybe start the data exchange by sending a (presumably short)
list/array/index-function of the dirty/not-dirty blocks status
(say, 0=not-dirty,1=dirty),
then putting if conditionals before the Isend/Irecv so that only
dirty blocks are exchanged?

I hope this helps,
Gus Correa




On 05/01/2013 01:28 PM, Thomas Watson wrote:

Hi,

I have a program where each MPI rank hosts a set of data blocks.
After
doing computation over *some of* its local data blocks, each MPI
rank
needs to exchange data with other ranks. Note that the
computation may
involve only a subset of the data blocks on a MPI rank. The data
exchange is achieved at each MPI rank through Isend and Irecv
and then
Waitall to complete the requests. Each pair of Isend and Irecv
exchanges
a corresponding pair of data blocks at different ranks. Right
now, we do
Isend/Irecv for EVERY block!

The idea is that because the computation at a rank may only
involves a
subset of blocks, we could mark those blocks as dirty during the
computation. And to reduce data exchange bandwidth, we could only
exchanges those *dirty* pairs across ranks.

The problem is: if a rank does not compute on a block 'm', and if it
does not call Isend for 'm', then the receiving rank must
somehow know
this and either a) does not call Irecv for 'm' as well, or b)
let Irecv
for 'm' fail gracefully.

My questions are:
1. how Irecv will behave (actually how MPI_Waitall will behave)
if the
corresponding Isend is missing?

2. If we still post Isend for 'm', but because we really do not
need to
send any data for 'm', can I just set a "flag" in Isend so that
MPI_Waitall on the receiving side will "cancel" the
corresponding Irecv
immediately? For example, I can set the count in Isend to 0, and
on the
receiving side, when MPI_Waitall see a message with empty
payload, it
reclaims the corresponding Irecv? In my code, the correspondence
between
a pair of Isend and Irecv is established by a matching TAG.

Thanks!

Jacky


_
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/__mailman/listinfo.cgi/users



_
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/__mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] How to reduce Isend & Irecv bandwidth?

2013-05-01 Thread Aurélien Bouteiller
Hi Jacky, 

1. If you do not post a matching send, the wait(all) on the recv will stall 
forever. 
2. You can match a recv(count, tag, src) with a send(0, tag, dst). The recv 
will complete, the status can be inspected to verify how many bytes have 
actually been received. It is illegal to send more than what count can hold at 
the receiver, but it is perfectly fine to send less. 

Hope it helps, 
Aurelien


Le 1 mai 2013 à 18:05, Gus Correa  a écrit :

> Hi Thomas/Jacky
> 
> Maybe using MPI_Probe (and maybe also MPI_Cancel)
> to probe the message size,
> and receive only those with size>0?
> Anyway, I'm just code-guessing.
> 
> I hope it helps,
> Gus Correa
> 
> On 05/01/2013 05:14 PM, Thomas Watson wrote:
>> Hi Gus,
>> 
>> Thanks for your suggestion!
>> 
>> The problem of this two-phased data exchange is as follows. Each rank
>> can have data blocks that will be exchanged to potentially all other
>> ranks. So if a rank needs to tell all the other ranks about which blocks
>> to receive, it would require an all-to-all collective communication
>> during phase one (e.g., MPI_Gatherallv). Because such collective
>> communication is blocking in current stable OpenMPI (MPI-2), it would
>> have a negative impact on scalability of the application, especially
>> when we have a large number of MPI ranks. This negative impact would not
>> be compensated by the bandwidth saved :-)
>> 
>> What I really need is something like this: Isend sets count to 0 if a
>> block is not dirty. On the receiving side, MPI_Waitall deallocates the
>> corresponding Irecv request immediately and sets the Irecv request
>> handle to MPI_REQUEST_NULL as if it were a normal Irecv. I am wondering
>> if someone could confirm this behavior with me? I could do an experiment
>> on this too...
>> 
>> Regards,
>> 
>> Jacky
>> 
>> 
>> 
>> 
>> On Wed, May 1, 2013 at 3:46 PM, Gus Correa > > wrote:
>> 
>>Maybe start the data exchange by sending a (presumably short)
>>list/array/index-function of the dirty/not-dirty blocks status
>>(say, 0=not-dirty,1=dirty),
>>then putting if conditionals before the Isend/Irecv so that only
>>dirty blocks are exchanged?
>> 
>>I hope this helps,
>>Gus Correa
>> 
>> 
>> 
>> 
>>On 05/01/2013 01:28 PM, Thomas Watson wrote:
>> 
>>Hi,
>> 
>>I have a program where each MPI rank hosts a set of data blocks.
>>After
>>doing computation over *some of* its local data blocks, each MPI
>>rank
>>needs to exchange data with other ranks. Note that the
>>computation may
>>involve only a subset of the data blocks on a MPI rank. The data
>>exchange is achieved at each MPI rank through Isend and Irecv
>>and then
>>Waitall to complete the requests. Each pair of Isend and Irecv
>>exchanges
>>a corresponding pair of data blocks at different ranks. Right
>>now, we do
>>Isend/Irecv for EVERY block!
>> 
>>The idea is that because the computation at a rank may only
>>involves a
>>subset of blocks, we could mark those blocks as dirty during the
>>computation. And to reduce data exchange bandwidth, we could only
>>exchanges those *dirty* pairs across ranks.
>> 
>>The problem is: if a rank does not compute on a block 'm', and if it
>>does not call Isend for 'm', then the receiving rank must
>>somehow know
>>this and either a) does not call Irecv for 'm' as well, or b)
>>let Irecv
>>for 'm' fail gracefully.
>> 
>>My questions are:
>>1. how Irecv will behave (actually how MPI_Waitall will behave)
>>if the
>>corresponding Isend is missing?
>> 
>>2. If we still post Isend for 'm', but because we really do not
>>need to
>>send any data for 'm', can I just set a "flag" in Isend so that
>>MPI_Waitall on the receiving side will "cancel" the
>>corresponding Irecv
>>immediately? For example, I can set the count in Isend to 0, and
>>on the
>>receiving side, when MPI_Waitall see a message with empty
>>payload, it
>>reclaims the corresponding Irecv? In my code, the correspondence
>>between
>>a pair of Isend and Irecv is established by a matching TAG.
>> 
>>Thanks!
>> 
>>Jacky
>> 
>> 
>>_
>>users mailing list
>>us...@open-mpi.org 
>>http://www.open-mpi.org/__mailman/listinfo.cgi/users
>>
>> 
>> 
>>_
>>users mailing list
>>us...@open-mpi.org 
>>http://www.open-mpi.org/__mailman/listinfo.cgi/users
>>
>> 
>> 
>> 
>> 
>> ___

Re: [OMPI users] How to reduce Isend & Irecv bandwidth?

2013-05-01 Thread Thomas Watson
Hi Aurelien,

Excellent! Point 2) is exactly what I need - no data is actually sent and
Irecv completes normally.

Thanks!

Jacky


On Wed, May 1, 2013 at 6:29 PM, Aurélien Bouteiller wrote:

> Hi Jacky,
>
> 1. If you do not post a matching send, the wait(all) on the recv will
> stall forever.
> 2. You can match a recv(count, tag, src) with a send(0, tag, dst). The
> recv will complete, the status can be inspected to verify how many bytes
> have actually been received. It is illegal to send more than what count can
> hold at the receiver, but it is perfectly fine to send less.
>
> Hope it helps,
> Aurelien
>
>
> Le 1 mai 2013 à 18:05, Gus Correa  a écrit :
>
> > Hi Thomas/Jacky
> >
> > Maybe using MPI_Probe (and maybe also MPI_Cancel)
> > to probe the message size,
> > and receive only those with size>0?
> > Anyway, I'm just code-guessing.
> >
> > I hope it helps,
> > Gus Correa
> >
> > On 05/01/2013 05:14 PM, Thomas Watson wrote:
> >> Hi Gus,
> >>
> >> Thanks for your suggestion!
> >>
> >> The problem of this two-phased data exchange is as follows. Each rank
> >> can have data blocks that will be exchanged to potentially all other
> >> ranks. So if a rank needs to tell all the other ranks about which blocks
> >> to receive, it would require an all-to-all collective communication
> >> during phase one (e.g., MPI_Gatherallv). Because such collective
> >> communication is blocking in current stable OpenMPI (MPI-2), it would
> >> have a negative impact on scalability of the application, especially
> >> when we have a large number of MPI ranks. This negative impact would not
> >> be compensated by the bandwidth saved :-)
> >>
> >> What I really need is something like this: Isend sets count to 0 if a
> >> block is not dirty. On the receiving side, MPI_Waitall deallocates the
> >> corresponding Irecv request immediately and sets the Irecv request
> >> handle to MPI_REQUEST_NULL as if it were a normal Irecv. I am wondering
> >> if someone could confirm this behavior with me? I could do an experiment
> >> on this too...
> >>
> >> Regards,
> >>
> >> Jacky
> >>
> >>
> >>
> >>
> >> On Wed, May 1, 2013 at 3:46 PM, Gus Correa  >> > wrote:
> >>
> >>Maybe start the data exchange by sending a (presumably short)
> >>list/array/index-function of the dirty/not-dirty blocks status
> >>(say, 0=not-dirty,1=dirty),
> >>then putting if conditionals before the Isend/Irecv so that only
> >>dirty blocks are exchanged?
> >>
> >>I hope this helps,
> >>Gus Correa
> >>
> >>
> >>
> >>
> >>On 05/01/2013 01:28 PM, Thomas Watson wrote:
> >>
> >>Hi,
> >>
> >>I have a program where each MPI rank hosts a set of data blocks.
> >>After
> >>doing computation over *some of* its local data blocks, each MPI
> >>rank
> >>needs to exchange data with other ranks. Note that the
> >>computation may
> >>involve only a subset of the data blocks on a MPI rank. The data
> >>exchange is achieved at each MPI rank through Isend and Irecv
> >>and then
> >>Waitall to complete the requests. Each pair of Isend and Irecv
> >>exchanges
> >>a corresponding pair of data blocks at different ranks. Right
> >>now, we do
> >>Isend/Irecv for EVERY block!
> >>
> >>The idea is that because the computation at a rank may only
> >>involves a
> >>subset of blocks, we could mark those blocks as dirty during the
> >>computation. And to reduce data exchange bandwidth, we could only
> >>exchanges those *dirty* pairs across ranks.
> >>
> >>The problem is: if a rank does not compute on a block 'm', and
> if it
> >>does not call Isend for 'm', then the receiving rank must
> >>somehow know
> >>this and either a) does not call Irecv for 'm' as well, or b)
> >>let Irecv
> >>for 'm' fail gracefully.
> >>
> >>My questions are:
> >>1. how Irecv will behave (actually how MPI_Waitall will behave)
> >>if the
> >>corresponding Isend is missing?
> >>
> >>2. If we still post Isend for 'm', but because we really do not
> >>need to
> >>send any data for 'm', can I just set a "flag" in Isend so that
> >>MPI_Waitall on the receiving side will "cancel" the
> >>corresponding Irecv
> >>immediately? For example, I can set the count in Isend to 0, and
> >>on the
> >>receiving side, when MPI_Waitall see a message with empty
> >>payload, it
> >>reclaims the corresponding Irecv? In my code, the correspondence
> >>between
> >>a pair of Isend and Irecv is established by a matching TAG.
> >>
> >>Thanks!
> >>
> >>Jacky
> >>
> >>
> >>_
> >>users mailing list
> >>us...@open-mpi.org 
> >>http://www.open-m