Dear Gilles, Please check my replies inline. >>> Can you please post the output of >>> ompi_info --param btl vader --level 3 >>> with both Open MPI 3.1 and 4.1? openMPI3.1.1 ------------------ $ ompi_info --param btl vader --level 3 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.1) MCA btl vader: --------------------------------------------------- MCA btl vader: parameter "btl_vader_single_copy_mechanism" (current value: "cma", data source: default, level: 3 user/all, type: int) Single copy mechanism to use (defaults to best available) Valid values: 1:"cma", 3:"none" openMPI4.1.0 ------------------ $ ompi_info --param btl vader --level 3 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.0) MCA btl vader: --------------------------------------------------- MCA btl vader: parameter "btl_vader_single_copy_mechanism" (current value: "cma", data source: default, level: 3 user/all, type: int) Single copy mechanism to use (defaults to best available) Valid values: 1:"cma", 4:"emulated", 3:"none" MCA btl vader: parameter "btl_vader_backing_directory" (current value: "/dev/shm", data source: default, level: 3 user/all, type: string) Directory to place backing files for shared memory communication. This directory should be on a local filesystem such as /tmp or /dev/shm (default: (linux) /dev/shm, (others) session directory)
>>> What if you run with only 2 MPI ranks? >>> do you observe similar performance differences between Open MPI 3.1 and 4.1? When I run only 2 MPI ranks, the performance regression is not significant. openMPI3.1.1 gives MFLOPS: 11122 openMPI4.1.0 gives MFLOPS: 11041 With Regards, S. Biplab Raut From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> Sent: Friday, March 12, 2021 7:07 PM To: Raut, S Biplab <biplab.r...@amd.com> Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ? [CAUTION: External Email] Can you please post the output of ompi_info --param btl vader --level 3 with both Open MPI 3.1 and 4.1? What if you run with only 2 MPI ranks? do you observe similar performance differences between Open MPI 3.1 and 4.1? Cheers, Gilles On Fri, Mar 12, 2021 at 6:31 PM Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote: Dear Gilles, Thank you for the reply. >>> when running >>> mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl >>> vader,self ./mpi-bench ic1000000 >>> I go similar flops with Open MPI 3.1.1, 3.1.6, 4.1.0 and 4.1.1rc1 on my >>> system >>> If you are using a different command line, please let me know and I will >>> give it a try Although the command line that I use is different, but I ran with the above command line as used by you. I still find that openMPI4.1.0 is poor as compared to openMPI3.1.1. Please check the details below. I have also provided my system details if it matters. openMPI3.1.1 ------------------- $ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl vader,self ./mpi-bench ic1000000 Problem: ic1000000, setup: 552.20 ms, time: 1.33 ms, ``mflops'': 75143 $ ompi_info --all|grep 'command line' Configure command line: '--prefix=/home/server/ompi3/gcc' '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility' User-specified command line parameters passed to ROMIO's configure script Complete set of command line parameters passed to ROMIO's configure script openMPI4.1.0 ------------------- $ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl vader,self ./mpi-bench ic1000000 Problem: ic1000000, setup: 557.12 ms, time: 1.75 ms, ``mflops'': 57029 $ ompi_info --all|grep 'command line' Configure command line: '--prefix=/home/server/ompi4_plain' '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility' User-specified command line parameters passed to ROMIO's configure script Complete set of command line parameters passed to ROMIO's configure script openMPI4.1.0 + xpmem -------------------------------- $ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl vader,self ./mpi-bench ic1000000 -------------------------------------------------------------------------- WARNING: Could not generate an xpmem segment id for this process' address space. The vader shared memory BTL will fall back on another single-copy mechanism if one is available. This may result in lower performance. Local host: lib-daytonax-03 Error code: 2 (No such file or directory) -------------------------------------------------------------------------- Problem: ic1000000, setup: 559.55 ms, time: 1.77 ms, ``mflops'': 56280 $ ompi_info --all|grep 'command line' Configure command line: '--prefix=/home/server/ompi4_xmem' '--with-xpmem=/opt/xpmm' '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility' User-specified command line parameters passed to ROMIO's configure script Complete set of command line parameters passed to ROMIO's configure script Other System Config ---------------------------- - $ cat /etc/os-release NAME="Ubuntu" VERSION="20.04 LTS (Focal Fossa)" * $ gcc -v cc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) * DRAM:- 1TB DDR4-3200 MT/s RDIMM memory The recommended command line to run would be as below:- mpirun --map-by core --rank-by core --bind-to core --mca pml ob1 --mca btl vader,self ./mpi-bench -owisdom -opatient -r1000 -s icf1000000 (Here, -opatient would allow the use of best kernel/algorithm plan, -r1000 would run the test for 1000 iterations to avoid run-to-run variations, -owisdom would take off the first-time setup overhead/time when executing the “mpirun command line” next time) Please suggest me if any other details needed for you to analyze this performance regression? With Regards, S. Biplab Raut From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> Sent: Friday, March 12, 2021 12:46 PM To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ? [CAUTION: External Email] when running mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca btl vader,self ./mpi-bench ic1000000 I go similar flops with Open MPI 3.1.1, 3.1.6, 4.1.0 and 4.1.1rc1 on my system If you are using a different command line, please let me know and I will give it a try Cheers, Gilles On Fri, Mar 12, 2021 at 3:20 PM Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote: Reposting here without the logs – it seems there is a message size limit of 150KB and so could not attach the logs. (Request the moderator to approve the original mail that has attachment of compressed logs) My main concern in moving from ompi3.1.1 to ompi4.1.0 - Why does ompi4.1.0 perform poorly as compared to opmi3.1.1 for some test sizes??? I ran “FFTW MPI bench binary” in verbose mode “10” (as suggested by Gilles) for below three cases and confirmed that btl/vader is used by default. FFTW MPI test for a 1D problem size (1000000) is run on a single-node as below:- mpirun --map-by core --rank-by core --bind-to core -np 128 <fftw/mpi/bench program binary> <program binary options for problem size 1000000 > The three test cases are described below :- Test run with openMPI3.1.1 performs best. 1. Test run on Ubuntu20.04 and stock openMPI3.1.1 : gives mflops: 76978 2. Test run on Ubuntu20.04 and stock openMPI4.1.1 : gives mflops: 56205 3. Test run on Ubuntu20.04 and openMPI4.1.1 configured with xpmem : gives mflops: 56411 Please check more details in the below mail chain. P.S: FFTW MPI bench test binary can be compiled from sources https://github.com/amd/amd-fftw<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famd%2Famd-fftw&data=04%7C01%7CBiplab.Raut%40amd.com%7C79aeeebf5dbd40fb470708d8e55bf36d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637511530406784515%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k9gyvukWOc4UqJP9RSpC3d7O3w0Ce24xKL6KvUnfCjs%3D&reserved=0> OR https://github.com/FFTW/fftw3<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FFFTW%2Ffftw3&data=04%7C01%7CBiplab.Raut%40amd.com%7C79aeeebf5dbd40fb470708d8e55bf36d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637511530406794509%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Bn3nYVSHbKW%2FR01a8Mp18eAigE1ADwLBruooqhTIep0%3D&reserved=0> . With Regards, S. Biplab Raut From: Raut, S Biplab Sent: Thursday, March 11, 2021 5:45 PM To: Gilles Gouaillardet <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> Subject: RE: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ? Oh okay, got you. Please check below details. $ ompi_info --all|grep 'command line' Configure command line: '--prefix=/home/amd/ompi4_plain' '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility' User-specified command line parameters passed to ROMIO's configure script Complete set of command line parameters passed to ROMIO's configure script For your other questions, please check my reply inline. >>> did you have any chance to profile the benchmark to understand where the >>> extra time is spent? >>> (point to point? collective? communicator creation? other?) The application binary is using point to point communication – isend and irecv with wait. Please check the below “perf report” hotspots:- Overhead Command Shared Object Symbol 58.54% mpi-bench libopen-pal.so.40.30.0 [.] mca_btl_vader_component_progress 4.59% mpi-bench libopen-pal.so.40.30.0 [.] mca_btl_vader_send 4.43% mpi-bench libopen-pal.so.40.30.0 [.] mca_btl_vader_poll_handle_frag 1.50% mpi-bench libmpi.so.40.30.0 [.] mca_pml_ob1_irecv 1.33% mpi-bench libmpi.so.40.30.0 [.] mca_pml_ob1_isend >>> have you tried running well known benchmarks such as IMB or OSU? >>> it would be interesting to understand where are the significant differences >>> and the minimum number of MPI tasks >>> required to exhibit them. I have not run them. I wonder if application developers like me have to really explore these benchmarks to design MPI based applications? With Regards, S. Biplab Raut From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> Sent: Thursday, March 11, 2021 5:16 PM To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ? [CAUTION: External Email] I am not aware of such performance issue. can you post the output of ompi_info --all | grep 'command line' did you have any chance to profile the benchmark to understand where the extra time is spent? (point to point? collective? communicator creation? other?) have you tried running well known benchmarks such as IMB or OSU? it would be interesting to understand where are the significant differences and the minimum number of MPI tasks required to exhibit them. Cheers, Gilles On Thu, Mar 11, 2021 at 8:33 PM Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote: Dear Gilles, Running with “mpirun --mca coll ^han” does not change the performance much. mpirun --map-by core --rank-by core --bind-to core -np 128 .libs/mpi-bench -owisdom -opatient -r1000 -s icf1000000 Problem: icf1000000, setup: 2.12 ms, time: 1.75 ms, ``mflops'': 56838 mpirun --mca coll ^han --map-by core --rank-by core --bind-to core -np 128 .libs/mpi-bench -owisdom -opatient -r1000 -s icf1000000 Problem: icf1000000, setup: 2.22 ms, time: 1.75 ms, ``mflops'': 57021 By the way, reiterating my original question, is there any known performance issue with openMPI4.x as compared to openMPI3.1.1 ? P.S: Let me repost these numbers on the forum for others to comment. With Regards, S. Biplab Raut > [AMD Official Use Only - Internal Distribution Only] > > Dear Experts, > Until recently, I was using openMPI3.1.1 to run > single node 128 ranks MPI application on Ubuntu18.04 and Ubuntu19.04. > But, now the OS on these machines are upgraded to Ubuntu20.04, and I have > been observing program hangs with openMPI3.1.1 version. > So, I tried with openMPI4.0.5 version – The program ran properly without any > issues but there is a performance regression in my application. > > Can I know the stable openMPI version recommended for Ubuntu20.04 that has no > known regression compared to v3.1.1. > > With Regards, > S. Biplab Raut > >