Re: [OMPI users] Incorrect file size from MPI_File_write (_all) when using MPI derived types for both view filetype and write datatype

2017-11-06 Thread Edgar Gabriel
I'll have a look at  it. I can confirm that I can replicate the problem, 
and I do not see an obvious mistake in your code for 1 process 
scenarios. Will keep you posted.


Thanks

Edgar


On 11/6/2017 7:52 AM, Christopher Brady wrote:

I have been working with a Fortran code that has to write a large array to disk excluding 
an outer strip of guard cells using MPI-IO. This uses two MPI types, one representing an 
array the size of the main array without its guard cells that is passed to 
MPI_File_set_view as the filetype, and another that represents the subsection of the main 
array not including the guard cells that is used as the datatype in MPI_File_write (same 
result with MPI_File_write_all). Both subarrays are created using 
MPI_Type_create_subarray. When the file size (per core) reaches a value of 512MB the 
final output size diverges from the expected one and is always smaller than expected. It 
does not reach a hard bound, but is always smaller than expected. I have replicated this 
behaviour on machines using Open-MPI 2.1.2 and 3.0.0, and am attaching a simple test code 
(in both C and "use mpi" Fortran) that replicates the behaviour on a single 
core (the test codes only work on a single core, but I have demonstrated the same problem 
on multiple cores with our main code). While I've replicated this behaviour on several 
machines (every machine that I've tried it on), I'm also attaching the ompi_info output 
and config.log files for a machine that demonstrates the problem.

If anyone can tell me if I've made a mistake, if this is a known bug that I've 
missed in the archives (very sorry if so), or even if this is a previously 
unknown bug I'd be very grateful.

Many Thanks
Chris Brady
Senior Research Software Engineer
University of Warwick



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Incorrect file size from MPI_File_write (_all) when using MPI derived types for both view filetype and write datatype

2017-11-06 Thread Gilles Gouaillardet
Chris,

Can you try to

mpirun --mca io romio314 ...

And see if it helps ?

Cheers,

Gilles

Christopher Brady  wrote:
>I have been working with a Fortran code that has to write a large array to 
>disk excluding an outer strip of guard cells using MPI-IO. This uses two MPI 
>types, one representing an array the size of the main array without its guard 
>cells that is passed to MPI_File_set_view as the filetype, and another that 
>represents the subsection of the main array not including the guard cells that 
>is used as the datatype in MPI_File_write (same result with 
>MPI_File_write_all). Both subarrays are created using 
>MPI_Type_create_subarray. When the file size (per core) reaches a value of 
>512MB the final output size diverges from the expected one and is always 
>smaller than expected. It does not reach a hard bound, but is always smaller 
>than expected. I have replicated this behaviour on machines using Open-MPI 
>2.1.2 and 3.0.0, and am attaching a simple test code (in both C and "use mpi" 
>Fortran) that replicates the behaviour on a single core (the test codes only 
>work on a single core, but I have demonstra
 ted the same problem on multiple cores with our main code). While I've 
replicated this behaviour on several machines (every machine that I've tried it 
on), I'm also attaching the ompi_info output and config.log files for a machine 
that demonstrates the problem.
>
>If anyone can tell me if I've made a mistake, if this is a known bug that I've 
>missed in the archives (very sorry if so), or even if this is a previously 
>unknown bug I'd be very grateful.
>
>Many Thanks
>Chris Brady
>Senior Research Software Engineer
>University of Warwick
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Incorrect file size from MPI_File_write (_all) when using MPI derived types for both view filetype and write datatype

2017-11-06 Thread Edgar Gabriel
yes, the result is correct in that case, we must be overrunning a 
counter or something similar in the ompio code.


Edgar


On 11/6/2017 8:40 AM, Gilles Gouaillardet wrote:

Chris,

Can you try to

mpirun --mca io romio314 ...

And see if it helps ?

Cheers,

Gilles

Christopher Brady  wrote:

I have been working with a Fortran code that has to write a large array to disk excluding 
an outer strip of guard cells using MPI-IO. This uses two MPI types, one representing an 
array the size of the main array without its guard cells that is passed to 
MPI_File_set_view as the filetype, and another that represents the subsection of the main 
array not including the guard cells that is used as the datatype in MPI_File_write (same 
result with MPI_File_write_all). Both subarrays are created using 
MPI_Type_create_subarray. When the file size (per core) reaches a value of 512MB the 
final output size diverges from the expected one and is always smaller than expected. It 
does not reach a hard bound, but is always smaller than expected. I have replicated this 
behaviour on machines using Open-MPI 2.1.2 and 3.0.0, and am attaching a simple test code 
(in both C and "use mpi" Fortran) that replicates the behaviour on a single 
core (the test codes only work on a single core, but I have d

  emonstra
  ted the same problem on multiple cores with our main code). While I've 
replicated this behaviour on several machines (every machine that I've tried it 
on), I'm also attaching the ompi_info output and config.log files for a machine 
that demonstrates the problem.

If anyone can tell me if I've made a mistake, if this is a known bug that I've 
missed in the archives (very sorry if so), or even if this is a previously 
unknown bug I'd be very grateful.

Many Thanks
Chris Brady
Senior Research Software Engineer
University of Warwick


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 228Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
--


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Incorrect file size from MPI_File_write (_all) when using MPI derived types for both view filetype and write datatype

2017-11-06 Thread Christopher Brady
Yes, that does now give the expected behaviour from both Fortran and C both for 
the test codes and for my real code.

Chris

> On 6 Nov 2017, at 14:40, Gilles Gouaillardet  
> wrote:
> 
> Chris,
> 
> Can you try to
> 
> mpirun --mca io romio314 ...
> 
> And see if it helps ?
> 
> Cheers,
> 
> Gilles
> 
> Christopher Brady  wrote:
>> I have been working with a Fortran code that has to write a large array to 
>> disk excluding an outer strip of guard cells using MPI-IO. This uses two MPI 
>> types, one representing an array the size of the main array without its 
>> guard cells that is passed to MPI_File_set_view as the filetype, and another 
>> that represents the subsection of the main array not including the guard 
>> cells that is used as the datatype in MPI_File_write (same result with 
>> MPI_File_write_all). Both subarrays are created using 
>> MPI_Type_create_subarray. When the file size (per core) reaches a value of 
>> 512MB the final output size diverges from the expected one and is always 
>> smaller than expected. It does not reach a hard bound, but is always smaller 
>> than expected. I have replicated this behaviour on machines using Open-MPI 
>> 2.1.2 and 3.0.0, and am attaching a simple test code (in both C and "use 
>> mpi" Fortran) that replicates the behaviour on a single core (the test codes 
>> only work on a single core, but I have demonstra
> ted the same problem on multiple cores with our main code). While I've 
> replicated this behaviour on several machines (every machine that I've tried 
> it on), I'm also attaching the ompi_info output and config.log files for a 
> machine that demonstrates the problem.
>> 
>> If anyone can tell me if I've made a mistake, if this is a known bug that 
>> I've missed in the archives (very sorry if so), or even if this is a 
>> previously unknown bug I'd be very grateful.
>> 
>> Many Thanks
>> Chris Brady
>> Senior Research Software Engineer
>> University of Warwick
>> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

2017-11-06 Thread George Bosilca
On Sun, Nov 5, 2017 at 10:23 PM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:

> Hi George,
>
> First, let me note that the cost of q^(k-1)]*(q-1) communicators was fine
> for the values of parameters q,k I am working with. Also, the whole point
> of speeding up the shuffling phase is trying to reduce this number even
> more (compared to already known implementations) which is a major concern
> of my project. But thanks for pointing that out. Btw, do you know what is
> the maximum such number in MPI?
>

Last time I run into such troubles these limits were: 2k for MVAPICH, 16k
for MPICH and 2^30-1 for OMPI (all positive signed 23 bits integers). It
might have changed meanwhile.


> Now to the main part of the question, let me clarify that I have 1 process
> per machine. I don't know if this is important here but my way of thinking
> is that we have a big text file and each process will have to work on some
> chunks of it (like chapters of a book). But each process resides in an
> machine with some RAM which is able to handle a specific amount of work so
> if you generate many processes per machine you must have fewer book
> chapters per process than before. Thus, I wanted to avoid thinking in the
> process-level rather than machine-level with the RAM limitations.
>
> Now to the actual shuffling, here is what I am currently doing (Option 1):
>
> Let's denote the data that slave s has to send to the slaves in group G as
> D(s,G).
>
> *for each slave s in 1,2,...,K{*
>
> *for each group G that s participates into{*
>
> *if (my rank is s){*
> *MPI_Bcast(send data D(s,G))*
> *}else if(my rank is in group G)*
> *MPI_Bcast(get data D(s,G))*
> *}else{*
> *   Do nothing*
> *}*
>
> *}*
>
> *MPI::COMM_WORLD.Barrier();*
>
> *}*
>
> What I suggested before to speedup things (Option 2) is:
>
> *for each set {G(1),G(2),...,G(q-1)} of q-1 disjoint groups{ *
>
> *for each slave s in G(1)*
> *if (my rank is s){*
> *MPI_Bcast(send data D(s,G(1)))*
> *}else if(**my rank is in** group G(1))*
> *MPI_Bcast(get data D(s,G(1)))*
> *}else{*
> *   Do nothing*
> *}*
> *}*
>
> *for each slave s in G(2)*
> *if (my rank is s){*
> *MPI_Bcast(send data D(s,G(2)))*
> *}else if(**my rank is in** G(2))*
> *MPI_Bcast(get data D(s,G(2)))*
> *}else{*
> *   Do nothing*
> *}*
> *}*
>
> *...*
>
> *for each slave s in G(q-1)*
> *if (my rank is s){*
> *MPI_Bcast(send data D(s,G(q-1)))*
> *}else if(**my rank is in** G(q-1))*
> *MPI_Bcast(get data D(s,G(q-1)))*
> *}else{*
> *   Do nothing*
> *}*
> *}*
>
> *MPI::COMM_WORLD.Barrier();*
>
> *}*
>
> My hope was that I could implement Option 2 (in some way without copying
> and pasting the same code q-1 times every time I change q) and that this
> could bring a speedup of q-1 compared to Option 1 by having these groups
> communicate in parallel. Right, now I am trying to find a way to identify
> these sets of groups based on my implementation, which involves some
> abstract algebra but for now let's assume that I can find them in an
> efficient manner.
>
> Let me emphasize that each broadcast sends different actual data. There
> are no two broadcasts that send the same D(s,G).
>
> Finally, let's go to MPI_Allgather(): I am really confused since I have
> never used this call but I have this image in my mind:
>
>
>
If every member of a group does a bcast to all other members of the same
group, then this operation is better realized by an allgather. The picture
you attached clearly expose the data movement pattern where each color box
gets distributed to all members of the same communicator. You could also
see this operation as a loop of bcast where the iterator goes over all
members of the communicator and use it as a root.


> ​
> I am not sure what you meant but now I am thinking of this (let commG be
> the intra-communicator of group G):
>
> *for each possible group G{*
>
> *if (my rank is in G){*
> *commG.MPI_AllGather(**send data D(rank,G)**)*
> *}**else{*
> *Do nothing*
> *}*
>
> *MPI::COMM_WORLD.Barrier();*
>
> *}*
>

This is indeed what I was thinking about, with the condition that you make
sure the list of communicators in G is ordered in the same way on all
processes.

That being said, this communication pattern 1) generated a large barrier in
your code; 2) as all processes will potentially be involved in many
collective communications you will be hammering the network in a
significant way (so you will have to take into account the network
congestion); and 3) all processes need to have all memory for receive
allocated for the buffers. Thus, even be implementing a nice communication
scheme you might encounter some performance issues.

Another way to do this is 

Re: [OMPI users] Can't connect using MPI Ports

2017-11-06 Thread Florian Lindner
Am 05.11.2017 um 20:57 schrieb r...@open-mpi.org:
> 
>> On Nov 5, 2017, at 6:48 AM, Florian Lindner > > wrote:
>>
>> Am 04.11.2017 um 00:05 schrieb r...@open-mpi.org :
>>> Yeah, there isn’t any way that is going to work in the 2.x series. I’m not 
>>> sure it was ever fixed, but you might try
>>> the latest 3.0, the 3.1rc, and even master.
>>>
>>> The only methods that are known to work are:
>>>
>>> * connecting processes within the same mpirun - e.g., using comm_spawn
>>
>> That is not an option for our application.
>>
>>> * connecting processes across different mpiruns, with the ompi-server 
>>> daemon as the rendezvous point
>>>
>>> The old command line method (i.e., what you are trying to use) hasn’t been 
>>> much on the radar. I don’t know if someone
>>> else has picked it up or not...
>>
>> What do you mean with "the old command line method”.
>>
>> Isn't the ompi-server just another means of exchanging port names, i.e. the 
>> same I do using files?
> 
> No, it isn’t - there is a handshake that ompi-server facilitates.
> 
>>
>> In my understanding, using Publish_name and Lookup_name or exchanging the 
>> information using files (or command line or
>> stdin) shouldn't have any
>> impact on the connection (Connect / Accept) itself.
> 
> Depends on the implementation underneath connect/accept.
> 
> The initial MPI standard authors had fixed in their minds that the 
> connect/accept handshake would take place over a TCP
> socket, and so no intermediate rendezvous broker was involved. That isn’t how 
> we’ve chosen to implement it this time
> around, and so you do need the intermediary. If/when some developer wants to 
> add another method, they are welcome to do
> so - but the general opinion was that the broker requirement was fine.

Ok. Just to make sure I understood correctly:

The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of 
MPI_Open_port, MPI_Comm_accept and
MPI_Comm_connect is not usuable without running an ompi-server as a third 
process?

Thank again,
Florian
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Can't connect using MPI Ports

2017-11-06 Thread r...@open-mpi.org

> On Nov 6, 2017, at 7:46 AM, Florian Lindner  wrote:
> 
> Am 05.11.2017 um 20:57 schrieb r...@open-mpi.org:
>> 
>>> On Nov 5, 2017, at 6:48 AM, Florian Lindner >> > wrote:
>>> 
>>> Am 04.11.2017 um 00:05 schrieb r...@open-mpi.org :
 Yeah, there isn’t any way that is going to work in the 2.x series. I’m not 
 sure it was ever fixed, but you might try
 the latest 3.0, the 3.1rc, and even master.
 
 The only methods that are known to work are:
 
 * connecting processes within the same mpirun - e.g., using comm_spawn
>>> 
>>> That is not an option for our application.
>>> 
 * connecting processes across different mpiruns, with the ompi-server 
 daemon as the rendezvous point
 
 The old command line method (i.e., what you are trying to use) hasn’t been 
 much on the radar. I don’t know if someone
 else has picked it up or not...
>>> 
>>> What do you mean with "the old command line method”.
>>> 
>>> Isn't the ompi-server just another means of exchanging port names, i.e. the 
>>> same I do using files?
>> 
>> No, it isn’t - there is a handshake that ompi-server facilitates.
>> 
>>> 
>>> In my understanding, using Publish_name and Lookup_name or exchanging the 
>>> information using files (or command line or
>>> stdin) shouldn't have any
>>> impact on the connection (Connect / Accept) itself.
>> 
>> Depends on the implementation underneath connect/accept.
>> 
>> The initial MPI standard authors had fixed in their minds that the 
>> connect/accept handshake would take place over a TCP
>> socket, and so no intermediate rendezvous broker was involved. That isn’t 
>> how we’ve chosen to implement it this time
>> around, and so you do need the intermediary. If/when some developer wants to 
>> add another method, they are welcome to do
>> so - but the general opinion was that the broker requirement was fine.
> 
> Ok. Just to make sure I understood correctly:
> 
> The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of 
> MPI_Open_port, MPI_Comm_accept and
> MPI_Comm_connect is not usuable without running an ompi-server as a third 
> process?

Yes, that’s correct. The reason for moving in that direction is that the 
resource managers, as they continue to integrate PMIx into them, are going to 
be providing that third party. This will make connect/accept much easier to 
use, and a great deal more scalable.

See https://github.com/pmix/RFCs/blob/master/RFC0003.md 
 for an explanation.

> 
> Thank again,
> Florian
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Incorrect file size from MPI_File_write (_all) when using MPI derived types for both view filetype and write datatype

2017-11-06 Thread Edgar Gabriel
ok, I found the problem, a book-keeping error when the read/write 
operations spans multiple internal cycles. Will commit a fix shortly to 
master, and will try to get a pr into the 3.0.x and the upcoming 3.1.x 
series, not sure however which precise release the fix will make it in.


Thanks for the bug report!

Edgar


On 11/6/2017 8:25 AM, Edgar Gabriel wrote:

I'll have a look at  it. I can confirm that I can replicate the problem,
and I do not see an obvious mistake in your code for 1 process
scenarios. Will keep you posted.

Thanks

Edgar


On 11/6/2017 7:52 AM, Christopher Brady wrote:

I have been working with a Fortran code that has to write a large array to disk excluding 
an outer strip of guard cells using MPI-IO. This uses two MPI types, one representing an 
array the size of the main array without its guard cells that is passed to 
MPI_File_set_view as the filetype, and another that represents the subsection of the main 
array not including the guard cells that is used as the datatype in MPI_File_write (same 
result with MPI_File_write_all). Both subarrays are created using 
MPI_Type_create_subarray. When the file size (per core) reaches a value of 512MB the 
final output size diverges from the expected one and is always smaller than expected. It 
does not reach a hard bound, but is always smaller than expected. I have replicated this 
behaviour on machines using Open-MPI 2.1.2 and 3.0.0, and am attaching a simple test code 
(in both C and "use mpi" Fortran) that replicates the behaviour on a single 
core (the test codes only work on a single core, but I have demonstrated the same problem 
on multiple cores with our main code). While I've replicated this behaviour on several 
machines (every machine that I've tried it on), I'm also attaching the ompi_info output 
and config.log files for a machine that demonstrates the problem.

If anyone can tell me if I've made a mistake, if this is a known bug that I've 
missed in the archives (very sorry if so), or even if this is a previously 
unknown bug I'd be very grateful.

Many Thanks
Chris Brady
Senior Research Software Engineer
University of Warwick


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users