John, as an aside it is always worth running 'lstopo' from the hwloc package to look at the layout of your cpus cores and caches. Getting a bit late now so I apologise for being too lazy to boot up my Pi to capture the output.
On Wed, 22 Jul 2020 at 19:55, George Bosilca via users < users@lists.open-mpi.org> wrote: > John, > > There are many things in play in such an experiment. Plus, expecting > linear speedup even at the node level is certainly overly optimistic. > > 1. A single core experiment has full memory bandwidth, so you will > asymptotically reach the max flops. Adding more cores will increase the > memory pressure, and at some point the memory will not be able to deliver, > and will become the limiting factor (not the computation capabilities of > the cores). > > 2. HPL communication pattern is composed of 3 types of messages. 1 element > in the panel (column) in the context of an allreduce (to find the max), > medium size (a decreasing multiple of NB as you progress in the > computation) for the swap operation, and finally some large messages of > (NB*NB*sizeof(elem)) for the update. All this to say that CMA_SIZE_MBYTES=5 > should be more than enough for you. > > Have fun, > George. > > > > On Wed, Jul 22, 2020 at 2:19 PM John Duffy via users < > users@lists.open-mpi.org> wrote: > >> Hi Joseph, John >> >> Thank you for your replies. >> >> I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster. >> >> The symptoms I’m experiencing are that the HPL Linpack performance in >> Gflops increases on a single core as NB is increased from 32 to 256. The >> theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I >> think is a reasonable expectation. However, as I add more cores on a single >> node, 2, 3 and finally 4 cores, the performance scaling is nowhere near >> linear, and tails off dramatically as NB is increased. I can achieve 15 >> Gflops on a single node of 4 cores, whereas the theoretical maximum is 24 >> Gflops per node. >> >> opmi_info suggest vader is available/working… >> >> MCA btl: openib (MCA v2.1.0, API v3.1.0, Component >> v4.0.3) >> MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3) >> MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3) >> MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3) >> >> I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting >> Open-MPI message number/size. So, I’m currently building a new kernel with >> CMA_SIZE_MBYTES=16. >> >> I have attached 2 plots from my experiments… >> >> Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a >> maximum value of 4.75 Gflops when NB = 240. >> >> Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same >> node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would >> hope that rather than drop off dramatically at NB = 168, the performance >> would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops. >> >> This is why I wondering whether Open-MPI messages via vader are being >> hampered by a limiting CMA size. >> >> Lets see what happens with my new kernel... >> >> Best regards >> >> John >> >> >>