Dear Gilles,
                    Please check my replies inline.

>>> Can you please post the output of
>>> ompi_info --param btl vader --level 3
>>> with both Open MPI 3.1 and 4.1?
openMPI3.1.1
------------------
$ ompi_info --param btl vader --level 3
                 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.1)
           MCA btl vader: ---------------------------------------------------
           MCA btl vader: parameter "btl_vader_single_copy_mechanism"
                          (current value: "cma", data source: default, level:
                          3 user/all, type: int)
                          Single copy mechanism to use (defaults to best
                          available)
                          Valid values: 1:"cma", 3:"none"
openMPI4.1.0
------------------
$ ompi_info --param btl vader --level 3
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.0)
           MCA btl vader: ---------------------------------------------------
           MCA btl vader: parameter "btl_vader_single_copy_mechanism"
                          (current value: "cma", data source: default, level:
                          3 user/all, type: int)
                          Single copy mechanism to use (defaults to best
                          available)
                          Valid values: 1:"cma", 4:"emulated", 3:"none"
           MCA btl vader: parameter "btl_vader_backing_directory" (current
                          value: "/dev/shm", data source: default, level: 3
                          user/all, type: string)
                          Directory to place backing files for shared memory
                          communication. This directory should be on a local
                          filesystem such as /tmp or /dev/shm (default:
                          (linux) /dev/shm, (others) session directory)

>>> What if you run with only 2 MPI ranks?
>>> do you observe similar performance differences between Open MPI 3.1 and 4.1?
When I run only 2 MPI ranks, the performance regression is not significant.
openMPI3.1.1 gives MFLOPS: 11122
openMPI4.1.0 gives MFLOPS: 11041

With Regards,
S. Biplab Raut

From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com>
Sent: Friday, March 12, 2021 7:07 PM
To: Raut, S Biplab <biplab.r...@amd.com>
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?

[CAUTION: External Email]
Can you please post the output of

ompi_info --param btl vader --level 3

with both Open MPI 3.1 and 4.1?



What if you run with only 2 MPI ranks?

do you observe similar performance differences between Open MPI 3.1 and 4.1?



Cheers,

Gilles

On Fri, Mar 12, 2021 at 6:31 PM Raut, S Biplab 
<biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:
Dear Gilles,
                    Thank you for the reply.

>>> when running
>>> mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl 
>>> vader,self ./mpi-bench ic1000000
>>> I go similar flops with Open MPI 3.1.1, 3.1.6, 4.1.0 and 4.1.1rc1 on my 
>>> system
>>> If you are using a different command line, please let me know and I will 
>>> give it a try
Although the command line that I use is different, but I ran with the above 
command line as used by you.
I still find that openMPI4.1.0 is poor as compared to openMPI3.1.1. Please 
check the details below. I have also provided my system details if it matters.
openMPI3.1.1
-------------------
$ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl 
vader,self ./mpi-bench ic1000000
Problem: ic1000000, setup: 552.20 ms, time: 1.33 ms, ``mflops'': 75143
$ ompi_info --all|grep 'command line'
  Configure command line: '--prefix=/home/server/ompi3/gcc' 
'--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' 
'--enable-static=yes' '--enable-mpi1-compatibility'
                          User-specified command line parameters passed to 
ROMIO's configure script
                          Complete set of command line parameters passed to 
ROMIO's configure script

openMPI4.1.0
-------------------
$ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl 
vader,self ./mpi-bench ic1000000
Problem: ic1000000, setup: 557.12 ms, time: 1.75 ms, ``mflops'': 57029
$ ompi_info --all|grep 'command line'
  Configure command line: '--prefix=/home/server/ompi4_plain' 
'--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' 
'--enable-static=yes' '--enable-mpi1-compatibility'
                          User-specified command line parameters passed to 
ROMIO's configure script
                          Complete set of command line parameters passed to 
ROMIO's configure script

openMPI4.1.0 + xpmem
--------------------------------
$ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl 
vader,self ./mpi-bench ic1000000
--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
  Local host: lib-daytonax-03
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------
Problem: ic1000000, setup: 559.55 ms, time: 1.77 ms, ``mflops'': 56280
$ ompi_info --all|grep 'command line'
  Configure command line: '--prefix=/home/server/ompi4_xmem' 
'--with-xpmem=/opt/xpmm' '--enable-mpi-fortran' '--enable-mpi-cxx' 
'--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility'
                          User-specified command line parameters passed to 
ROMIO's configure script
                          Complete set of command line parameters passed to 
ROMIO's configure script

Other System Config
----------------------------
-        $ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"

  *   $ gcc -v
cc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

  *   DRAM:- 1TB DDR4-3200 MT/s RDIMM memory

The recommended command line to run would be as below:-
mpirun --map-by core --rank-by core --bind-to core --mca pml ob1 --mca btl 
vader,self ./mpi-bench -owisdom -opatient -r1000 -s icf1000000
(Here, -opatient would allow the use of best kernel/algorithm plan,
            -r1000 would run the test for 1000 iterations to avoid run-to-run 
variations,
            -owisdom would take off the first-time setup overhead/time when 
executing the “mpirun command line” next time)

Please suggest me if any other details needed for you to analyze this 
performance regression?

With Regards,
S. Biplab Raut

From: Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>
Sent: Friday, March 12, 2021 12:46 PM
To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>>
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?

[CAUTION: External Email]
when running

mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl 
vader,self ./mpi-bench ic1000000

I go similar flops with Open MPI 3.1.1, 3.1.6, 4.1.0 and 4.1.1rc1 on my system

If you are using a different command line, please let me know and I will give 
it a try

Cheers,

Gilles


On Fri, Mar 12, 2021 at 3:20 PM Raut, S Biplab 
<biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:
Reposting here without the logs – it seems there is a message size limit of 
150KB and so could not attach the logs.
(Request the moderator to approve the original mail that has attachment of 
compressed logs)

My main concern in moving from ompi3.1.1 to ompi4.1.0 - Why does ompi4.1.0 
perform poorly as compared to opmi3.1.1 for some test sizes???


I ran “FFTW MPI bench binary” in verbose mode “10” (as suggested by Gilles) for 
below three cases and confirmed that btl/vader is used by default.

FFTW MPI test for a 1D problem size (1000000) is run on a single-node as below:-

mpirun --map-by core --rank-by core --bind-to core -np 128 <fftw/mpi/bench 
program binary> <program binary options for problem size 1000000 >



The three test cases are described below :- Test run with openMPI3.1.1 performs 
best.

  1.  Test run on Ubuntu20.04 and stock openMPI3.1.1 : gives mflops: 76978
  2.  Test run on Ubuntu20.04 and stock openMPI4.1.1 : gives mflops: 56205
  3.  Test run on Ubuntu20.04 and openMPI4.1.1 configured with xpmem : gives 
mflops: 56411



Please check more details in the below mail chain.



P.S:

FFTW MPI bench test binary can be compiled from sources 
https://github.com/amd/amd-fftw<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famd%2Famd-fftw&data=04%7C01%7CBiplab.Raut%40amd.com%7C79aeeebf5dbd40fb470708d8e55bf36d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637511530406784515%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k9gyvukWOc4UqJP9RSpC3d7O3w0Ce24xKL6KvUnfCjs%3D&reserved=0>
 OR 
https://github.com/FFTW/fftw3<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FFFTW%2Ffftw3&data=04%7C01%7CBiplab.Raut%40amd.com%7C79aeeebf5dbd40fb470708d8e55bf36d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637511530406794509%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Bn3nYVSHbKW%2FR01a8Mp18eAigE1ADwLBruooqhTIep0%3D&reserved=0>
 .


With Regards,
S. Biplab Raut

From: Raut, S Biplab
Sent: Thursday, March 11, 2021 5:45 PM
To: Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>
Subject: RE: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?

Oh okay, got you. Please check below details.

$ ompi_info --all|grep 'command line'
  Configure command line: '--prefix=/home/amd/ompi4_plain' 
'--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' 
'--enable-static=yes' '--enable-mpi1-compatibility'
                          User-specified command line parameters passed to 
ROMIO's configure script
                          Complete set of command line parameters passed to 
ROMIO's configure script


For your other questions, please check my reply inline.

>>> did you  have any chance to profile the benchmark to understand where the 
>>> extra time is spent?
>>> (point to point? collective? communicator creation? other?)
The application binary is using point to point communication – isend and irecv 
with wait.
Please check the below “perf report” hotspots:-
Overhead  Command          Shared Object           Symbol
  58.54%  mpi-bench        libopen-pal.so.40.30.0  [.] 
mca_btl_vader_component_progress
   4.59%  mpi-bench        libopen-pal.so.40.30.0  [.] mca_btl_vader_send
   4.43%  mpi-bench        libopen-pal.so.40.30.0  [.] 
mca_btl_vader_poll_handle_frag
   1.50%  mpi-bench        libmpi.so.40.30.0       [.] mca_pml_ob1_irecv
   1.33%  mpi-bench        libmpi.so.40.30.0       [.] mca_pml_ob1_isend

>>> have you tried running well known benchmarks such as IMB or OSU?
>>> it would be interesting to understand where are the significant differences 
>>> and the minimum number of MPI tasks
>>> required to exhibit them.
I have not run them. I wonder if application developers like me have to really 
explore these benchmarks to design MPI based applications?

With Regards,
S. Biplab Raut

From: Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>
Sent: Thursday, March 11, 2021 5:16 PM
To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>>
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?

[CAUTION: External Email]
I am not aware of such performance issue.

can you post the output of

ompi_info --all | grep 'command line'




did you  have any chance to profile the benchmark to understand where the extra 
time is spent?
(point to point? collective? communicator creation? other?)

have you tried running well known benchmarks such as IMB or OSU?
it would be interesting to understand where are the significant differences and 
the minimum number of MPI tasks
required to exhibit them.

Cheers,

Gilles
On Thu, Mar 11, 2021 at 8:33 PM Raut, S Biplab 
<biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:
Dear Gilles,
                   Running with “mpirun --mca coll ^han” does not change the 
performance much.

mpirun --map-by core --rank-by core --bind-to core -np 128 .libs/mpi-bench 
-owisdom -opatient -r1000 -s icf1000000
Problem: icf1000000, setup: 2.12 ms, time: 1.75 ms, ``mflops'': 56838
mpirun --mca coll ^han  --map-by core --rank-by core --bind-to core -np 128 
.libs/mpi-bench -owisdom -opatient -r1000 -s icf1000000
Problem: icf1000000, setup: 2.22 ms, time: 1.75 ms, ``mflops'': 57021

By the way, reiterating my original question, is there any known performance 
issue with openMPI4.x as compared to openMPI3.1.1 ?

P.S: Let me repost these numbers on the forum for others to comment.

With Regards,
S. Biplab Raut


> [AMD Official Use Only - Internal Distribution Only]

>

> Dear Experts,

>                         Until recently, I was using openMPI3.1.1 to run 
> single node 128 ranks MPI application on Ubuntu18.04 and Ubuntu19.04.

> But, now the OS on these machines are upgraded to Ubuntu20.04, and I have 
> been observing program hangs with openMPI3.1.1 version.

> So, I tried with openMPI4.0.5 version – The program ran properly without any 
> issues but there is a performance regression in my application.

>

> Can I know the stable openMPI version recommended for Ubuntu20.04 that has no 
> known regression compared to v3.1.1.

>

> With Regards,

> S. Biplab Raut

>

>

Reply via email to