Re: [OMPI users] MPI I/O, Romio vs Ompio on GPFS

2022-06-14 Thread Edgar Gabriel via users
Hi,

There are a few things that you could test to see whether they make difference.


  1.  Try to modify the number of aggregators used in collective I/O (assuming 
that the code uses collective I/O). You could try e.g. to set it to the number 
of nodes used (the algorithm determining the number of aggregators 
automatically is sometimes overly aggressive). E.g.



mpirun –mca io_ompio_num_aggregators 16 -np 256 ./executable name



(assuming here that you run 256 processes distributed on 16 nodes). Based on 
our tests from a while back gpfs was not super sensitive to this, but you never 
know, its worth a try.



  1.  If your data is large and mostly contiguous,  you could try to disable 
data sieving for write operations, e.g.



mpirun --mca fbtl_posix_write_datasieving 0 -np 256 ./…

Let me know if these make a difference. There are quite a couple of info 
objects that the gpfs fs component understands and that potentially could be 
used to tune the performance, but I do not have experience with them, they are 
based on code contributed by the HLRS a couple of years ago. You can still have 
a look at them and see whether some of them would make sense (source location: 
ompi/ompi/mca/fs/gpfs/fs_gpfs_file_set_info.c).

Thanks
Edgar


From: users  On Behalf Of Eric Chamberland 
via users
Sent: Saturday, June 11, 2022 9:28 PM
To: Open MPI Users 
Cc: Eric Chamberland ; Ramses van Zon 
; Vivien Clauzon ; 
dave.mar...@giref.ulaval.ca; Thomas Briffard 
Subject: Re: [OMPI users] MPI I/O, Romio vs Ompio on GPFS


Hi,

just almost found what I wanted with "--mca io_base_verbose 100"

Now I am looking at performances for GPFS and I must say OpenMPI 4.1.2 performs 
very poorly when it comes the time to write.

I am launching a 512 processes, read+compute (ghosts components of a mesh), and 
then later write a 79Gb file.

Here are the timings (all in seconds):



IO module ;  reading+ghost computing ; writing

ompio   ;   24.9   ; 2040+ (job got killed before completion)

romio321 ;  20.8; 15.6



I have run many times the job with Ompio module (the default) and Romio and the 
timings are always similar to those given.

I also activated maximum debug output with " --mca mca_base_verbose 
stdout,level:9  --mca mpi_show_mca_params all --mca io_base_verbose 100" and 
got a few lines but nothing relevant to debug:

Sat Jun 11 20:08:28 2022:chrono::ecritMaillageMPI::debut VmSize: 
6530408 VmRSS: 5599604 VmPeak: 7706396 VmData: 5734408 VmHWM: 5699324 
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: 
deleting file: resultat01_-2.mail
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: 
Checking all available modules
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: 
component available: ompio, priority: 30
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: 
component available: romio321, priority: 10
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] io:base:delete: 
Selected io component ompio
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] 
io:base:file_select: new file: resultat01_-2.mail
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] 
io:base:file_select: Checking all available modules
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] 
io:base:file_select: component available: ompio, priority: 30
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] 
io:base:file_select: component available: romio321, priority: 10
Sat Jun 11 20:08:28 2022:[nia0073.scinet.local:236683] 
io:base:file_select: Selected io module ompio

What else can I do to dig into this?

Are there parameters ompio is aware of with GPFS?

Thanks,

Eric

--

Eric Chamberland, ing., M. Ing

Professionnel de recherche

GIREF/Université Laval

(418) 656-2131 poste 41 22 42
On 2022-06-10 16:23, Eric Chamberland via users wrote:
Hi,

I want to try romio with OpenMPI 4.1.2 because I am observing a big performance 
difference with IntelMPI on GPFS.

I want to see, at *runtime*, all parameters (default values, names) used by MPI 
(at least for the "io" framework).

I would like to have all the same output as "ompi_info --all" gives me...

I have tried this:

mpiexec --mca io romio321  --mca mca_verbose 1  --mca mpi_show_mca_params 1 
--mca io_base_verbose 1 ...

But I cannot see anything about io coming out...

With "ompi_info" I do...

Is it possible?

Thanks,

Eric


--

Eric Chamberland, ing., M. Ing

Professionnel de recherche

GIREF/Université Laval

(418) 656-2131 poste 41 22 42


Re: [OMPI users] CephFS and striping_factor

2022-11-29 Thread Edgar Gabriel via users
[AMD Official Use Only - General]

I can also offer to help if there are any question regarding the ompio code, 
but I do not have the bandwidth/resources to do that myself, and more 
importantly, I do not have a platform to test the new component.
Edgar

From: users  On Behalf Of Jeff Squyres 
(jsquyres) via users
Sent: Tuesday, November 29, 2022 9:16 AM
To: users@lists.open-mpi.org
Cc: Jeff Squyres (jsquyres) 
Subject: Re: [OMPI users] CephFS and striping_factor

More specifically, Gilles created a skeleton "ceph" component in this draft 
pull request: https://github.com/open-mpi/ompi/pull/11122

If anyone has any cycles to work on it and develop it beyond the skeleton that 
is currently there, that would be great!

--
Jeff Squyres
jsquy...@cisco.com

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>>
Sent: Monday, November 28, 2022 9:48 PM
To: users@lists.open-mpi.org 
mailto:users@lists.open-mpi.org>>
Cc: Gilles Gouaillardet mailto:gil...@rist.or.jp>>
Subject: Re: [OMPI users] CephFS and striping_factor

Hi Eric,


Currently, Open MPI does not provide specific support for CephFS.

MPI-IO is either implemented by ROMIO (imported from MPICH, it does not
support CephFS today)

or the "native" ompio component (that also does not support CephFS today).


A proof of concept for CephFS in ompio might not be a huge work for
someone motivated:

That could be as simple as (so to speak, since things are generally not
easy) creating a new fs/ceph component

(e.g. in ompi/mca/fs/ceph) and implement the "file_open" callback that
uses the ceph API.

I think the fs/lustre component can be used as an inspiration.


I cannot commit to do this, but if you are willing to take a crack at
it, I can create such a component

so you can go directly to implementing the callback without spending too
much time on some Open MPI internals

(e.g. component creation).



Cheers,


Gilles


On 11/29/2022 6:55 AM, Eric Chamberland via users wrote:
> Hi,
>
> I would like to know if OpenMPI is supporting file creation with
> "striping_factor" for CephFS?
>
> According to CephFS library, I *think* it would be possible to do it
> at file creation with "ceph_open_layout".
>
> https://github.com/ceph/ceph/blob/main/src/include/cephfs/libcephfs.h
>
> Is it a possible futur enhancement?
>
> Thanks,
>
> Eric
>


Re: [OMPI users] GPU direct in OMPIO?

2022-12-05 Thread Edgar Gabriel via users
There was work done in ompio in that direction, but the code wasn’t actually 
committed into the main repository. It probably exists somewhere in a branch 
somewhere. If you are interested, please ping me directly and I can put you in 
contact with the person that wrote the code and to clarify the precise status.

Thanks
Edgar

From: users  On Behalf Of Jim Edwards via 
users
Sent: Monday, December 5, 2022 8:16 AM
To: Open MPI Users 
Cc: Jim Edwards 
Subject: [OMPI users] GPU direct in OMPIO?

Greetings,

Does the OMPIO library support GPU-Direct IO?  NVIDIA seems to suggest that it 
does,
 but I can't find details or examples.

--
Jim Edwards
CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO


Re: [OMPI users] What is the best choice of pml and btl for intranode communication

2023-03-06 Thread Edgar Gabriel via users
[AMD Official Use Only - General]

UCX will disqualify itself unless it finds cuda, rocm, or InfiniBand network to 
use. To allow UCX to run on a regular shared memory job without GPUs or IB, you 
have to set UCX_TLS environment variable explicitly allowe UCX to run for shm, 
e.g :

mpirun -x UCX_TLS=shm,self,ib --mca pml ucx ….

(I think you can also set UCX_TLS=all but am not entirely sure)

Thanks
Edgar


From: users  On Behalf Of George Bosilca via 
users
Sent: Monday, March 6, 2023 8:56 AM
To: Open MPI Users 
Cc: George Bosilca 
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication

ucx PML should work just fine even on a single node scenario. As Jeff indicated 
you need to move the MCA param `--mca pml ucx` before your command.

  George.


On Mon, Mar 6, 2023 at 9:48 AM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
If this run was on a single node, then UCX probably disabled itself since it 
wouldn't be using InfiniBand or RoCE to communicate between peers.

Also, I'm not sure your command line was correct:


perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca pml 
ucx

You probably need to list all of mpirun's CLI options before​ you list the 
./perf executable.  In its right-to-left traversal, once mpirun hits a CLI 
option it does not recognize (e.g., "./perf"), it assumes that it is the user's 
executable name, and does not process the CLI options to the right of that.

Hence, the output you show must have forced the UCX PML another way -- perhaps 
you set an environment variable or something?


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Chandran, Arun via users 
mailto:users@lists.open-mpi.org>>
Sent: Monday, March 6, 2023 3:33 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Chandran, Arun mailto:arun.chand...@amd.com>>
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication


[Public]



Hi Gilles,



Thanks very much for the information.



I was looking for the best pml + btl combination for a standalone intra node 
with high task count (>= 192) with no HPC-class networking installed.



Just now realized that I can’t use pml ucx for such cases as it is unable find 
IB and fails.



perf_benchmark $ mpirun -np 32 --map-by core --bind-to core ./perf  --mca pml 
ucx

--

No components were able to be opened in the pml framework.



This typically means that either no components of this type were

installed, or none of the installed components can be loaded.

Sometimes this means that shared libraries required by these

components are unable to be found/loaded.



  Host:  lib-ssp-04

  Framework: pml

--

[lib-ssp-04:753542] PML ucx cannot be selected

[lib-ssp-04:753531] PML ucx cannot be selected

[lib-ssp-04:753541] PML ucx cannot be selected

[lib-ssp-04:753539] PML ucx cannot be selected

[lib-ssp-04:753545] PML ucx cannot be selected

[lib-ssp-04:753547] PML ucx cannot be selected

[lib-ssp-04:753572] PML ucx cannot be selected

[lib-ssp-04:753538] PML ucx cannot be selected

[lib-ssp-04:753530] PML ucx cannot be selected

[lib-ssp-04:753537] PML ucx cannot be selected

[lib-ssp-04:753546] PML ucx cannot be selected

[lib-ssp-04:753544] PML ucx cannot be selected

[lib-ssp-04:753570] PML ucx cannot be selected

[lib-ssp-04:753567] PML ucx cannot be selected

[lib-ssp-04:753534] PML ucx cannot be selected

[lib-ssp-04:753592] PML ucx cannot be selected

[lib-ssp-04:753529] PML ucx cannot be selected





That means my only choice is pml/ob1 + btl/vader.



--Arun



From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Gilles Gouaillardet via users
Sent: Monday, March 6, 2023 12:56 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>>
Subject: Re: [OMPI users] What is the best choice of pml and btl for intranode 
communication



Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.



Arun,



First Open MPI selects a pml for **all** the MPI tasks (for example, pml/ucx or 
pml/ob1)



Then, if pml/ob1 ends up being selected, a btl component (e.g. btl/uct, 
btl/vader) is used for each pair of MPI tasks

(tasks on the same node will use btl/vader, tasks on different nodes will use 
btl/uct)



Note that if UCX is available, pml/ucx takes the highest priority, so no btl is 
involved

(in your case, if means intra-node communications will be handled by UCX and 
not btl/vader).

You can force ob1 and try different combinations of btl with

mpirun --mca pml ob1 --mca btl self,, ...



I expect pml/ucx is faster than pml/ob1 with btl/uct for inter node 
communications.



I have not benchmarked