I have seen in OSU INAM paper: " While we chose MVAPICH2 for implementing our designs, any MPI runtime (e.g.: OpenMPI [12]) can be modified to perform similar data collection and transmission. "
But i do not know what it is meant with "modified" openMPI ? Cheers, Denis ________________________________ From: Joseph Schuchart <schuch...@icl.utk.edu> Sent: Friday, February 11, 2022 3:02:36 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work with other MPI implementations? Would be worth investigating... Joseph On 2/11/22 06:54, Bertini, Denis Dr. wrote: > > Hi Joseph > > Looking at the MVAPICH i noticed that, in this MPI implementation > > a Infiniband Network Analysis and Profiling Tool is provided: > > > OSU-INAM > > > Is there something equivalent using openMPI ? > > Best > > Denis > > > ------------------------------------------------------------------------ > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Joseph > Schuchart via users <users@lists.open-mpi.org> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM > *To:* users@lists.open-mpi.org > *Cc:* Joseph Schuchart > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > Infiniband network > Hi Denis, > > Sorry if I missed it in your previous messages but could you also try > running a different MPI implementation (MVAPICH) to see whether Open MPI > is at fault or the system is somehow to blame for it? > > Thanks > Joseph > > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: > > > > Hi > > > > Thanks for all these informations ! > > > > > > But i have to confess that in this multi-tuning-parameter space, > > > > i got somehow lost. > > > > Furthermore it is somtimes mixing between user-space and kernel-space. > > > > I have only possibility to act on the user space. > > > > > > 1) So i have on the system max locked memory: > > > > - ulimit -l unlimited (default ) > > > > and i do not see any warnings/errors related to that when > launching MPI. > > > > > > 2) I tried differents algorithms for MPI_all_reduce op. all showing > > drop in > > > > bw for size=16384 > > > > > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed > > > > the same behaviour. > > > > > > 3) i realized that increasing the so-called warm up parameter in the > > > > OSU benchmark (argument -x 200 as default) the discrepancy. > > > > At the contrary putting lower threshold ( -x 10 ) can increase this BW > > > > discrepancy up to factor 300 at message size 16384 compare to > > > > message size 8192 for example. > > > > So does it means that there are some caching effects > > > > in the internode communication? > > > > > > From my experience, to tune parameters is a time-consuming and > cumbersome > > > > task. > > > > > > Could it also be the problem is not really on the openMPI > > implemenation but on the > > > > system? > > > > > > Best > > > > Denis > > > > ------------------------------------------------------------------------ > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Gus > > Correa via users <users@lists.open-mpi.org> > > *Sent:* Monday, February 7, 2022 9:14:19 PM > > *To:* Open MPI Users > > *Cc:* Gus Correa > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > > Infiniband network > > This may have changed since, but these used to be relevant points. > > Overall, the Open MPI FAQ have lots of good suggestions: > > https://www.open-mpi.org/faq/ > > some specific for performance tuning: > > https://www.open-mpi.org/faq/?category=tuning > > https://www.open-mpi.org/faq/?category=openfabrics > > > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely > > available in compute nodes: > > mpirun --mca btl self,sm,openib ... > > > > https://www.open-mpi.org/faq/?category=tuning#selecting-components > > > > However, this may have changed lately: > > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable > > 2) Maximum locked memory used by IB and their system limit. Start > > here: > > > https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage > > 3) The eager vs. rendezvous message size threshold. I wonder if it may > > sit right where you see the latency spike. > > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user > > 4) Processor and memory locality/affinity and binding (please check > > the current options and syntax) > > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 > > > > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users > > <users@lists.open-mpi.org> wrote: > > > > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > > > mpirun --verbose --display-map > > > > Have you tried newer OpenMPI versions? > > > > Do you get similar behavior for the osu_reduce and osu_gather > > benchmarks? > > > > Typically internal buffer sizes as well as your hardware will affect > > performance. Can you give specifications similar to what is > > available at: > > http://mvapich.cse.ohio-state.edu/performance/collectives/ > > where the operating system, switch, node type and memory are > > indicated. > > > > If you need good performance, may want to also specify the algorithm > > used. You can find some of the parameters you can tune using: > > > > ompi_info --all > > > > A particular helpful parameter is: > > > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current > > value: "ignore", data source: default, level: 5 tuner/detail, > > type: int) > > Which allreduce algorithm is used. Can be > > locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping > > (tuned > > reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented > ring > > Valid values: 0:"ignore", > > 1:"basic_linear", > > 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", > > 5:"segmented_ring", 6:"rabenseifner" > > MCA coll tuned: parameter > > "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", > > data > > source: default, level: 5 tuner/detail, type: int) > > > > For OpenMPI 4.0, there is a tuning program [2] that might also be > > helpful. > > > > [1] > > > https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi > > [2] https://github.com/open-mpi/ompi-collectives-tuning > > > > On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > > > Hi > > > > > > When i repeat i always got the huge discrepancy at the > > > > > > message size of 16384. > > > > > > May be there is a way to run mpi in verbose mode in order > > > > > > to further investigate this behaviour? > > > > > > Best > > > > > > Denis > > > > > > > > ------------------------------------------------------------------------ > > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of > > Benson > > > Muite via users <users@lists.open-mpi.org> > > > *Sent:* Monday, February 7, 2022 2:27:34 PM > > > *To:* users@lists.open-mpi.org > > > *Cc:* Benson Muite > > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking > > Infiniband > > > network > > > Hi, > > > Do you get similar results when you repeat the test? Another job > > could > > > have interfered with your run. > > > Benson > > > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: > > >> Hi > > >> > > >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in > > order to > > >> check/benchmark > > >> > > >> the infiniband network for our cluster. > > >> > > >> For that i use the collective all_reduce benchmark and run over > > 200 > > >> nodes, using 1 process per node. > > >> > > >> And this is the results i obtained 😎 > > >> > > >> > > >> > > >> ################################################################ > > >> > > >> # OSU MPI Allreduce Latency Test v5.7.1 > > >> # Size Avg Latency(us) Min Latency(us) Max > > Latency(us) Iterations > > >> 4 114.65 83.22 147.98 > > 1000 > > >> 8 133.85 106.47 164.93 > > 1000 > > >> 16 116.41 87.57 150.58 > > 1000 > > >> 32 112.17 93.25 130.23 > > 1000 > > >> 64 106.85 81.93 134.74 > > 1000 > > >> 128 117.53 87.50 152.27 > > 1000 > > >> 256 143.08 115.63 173.97 > > 1000 > > >> 512 130.34 100.20 167.56 > > 1000 > > >> 1024 155.67 111.29 188.20 > > 1000 > > >> 2048 151.82 116.03 198.19 > > 1000 > > >> 4096 159.11 122.09 199.24 > > 1000 > > >> 8192 176.74 143.54 221.98 > > 1000 > > >> 16384 48862.85 39270.21 54970.96 > > 1000 > > >> 32768 2737.37 2614.60 2802.68 > > 1000 > > >> 65536 2723.15 2585.62 2813.65 > > 1000 > > >> > > >> > > #################################################################### > > >> > > >> Could someone explain me what is happening for message = 16384 ? > > >> One can notice a huge latency (~ 300 time larger) compare to > > message > > >> size = 8192. > > >> I do not really understand what could create such an increase > > in the > > >> latency. > > >> The reason i use the OSU microbenchmarks is that we > > >> sporadically experience a drop > > >> in the bandwith for typical collective operations such as > > MPI_Reduce in > > >> our cluster > > >> which is difficult to understand. > > >> I would be grateful if somebody can share its expertise or such > > problem > > >> with me. > > >> > > >> Best, > > >> Denis > > >> > > >> > > >> > > >> --------- > > >> Denis Bertini > > >> Abteilung: CIT > > >> Ort: SB3 2.265a > > >> > > >> Tel: +49 6159 71 2240 > > >> Fax: +49 6159 71 2986 > > >> E-Mail: d.bert...@gsi.de > > >> > > >> GSI Helmholtzzentrum für Schwerionenforschung GmbH > > >> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de > > <http://www.gsi.de> > > >> > > >> Commercial Register / Handelsregister: Amtsgericht Darmstadt, > > HRB 1528 > > >> Managing Directors / Geschäftsführung: > > >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock > > >> Chairman of the GSI Supervisory Board / Vorsitzender des > > GSI-Aufsichtsrats: > > >> Ministerialdirigent Dr. Volkmar Dietz > > >> > > > > > >