John, as an aside it is always worth running 'lstopo' from the hwloc
package to look at the layout of your cpus cores and caches.
Getting a bit late now so I apologise for being too lazy to boot up my Pi
to capture the output.

On Wed, 22 Jul 2020 at 19:55, George Bosilca via users <
users@lists.open-mpi.org> wrote:

> John,
>
> There are many things in play in such an experiment. Plus, expecting
> linear speedup even at the node level is certainly overly optimistic.
>
> 1. A single core experiment has full memory bandwidth, so you will
> asymptotically reach the max flops. Adding more cores will increase the
> memory pressure, and at some point the memory will not be able to deliver,
> and will become the limiting factor (not the computation capabilities of
> the cores).
>
> 2. HPL communication pattern is composed of 3 types of messages. 1 element
> in the panel (column) in the context of an allreduce (to find the max),
> medium size (a decreasing multiple of NB as you progress in the
> computation) for the swap operation, and finally some large messages of
> (NB*NB*sizeof(elem)) for the update. All this to say that CMA_SIZE_MBYTES=5
> should be more than enough for you.
>
> Have fun,
>   George.
>
>
>
> On Wed, Jul 22, 2020 at 2:19 PM John Duffy via users <
> users@lists.open-mpi.org> wrote:
>
>> Hi Joseph, John
>>
>> Thank you for your replies.
>>
>> I’m using Ubuntu 20.04 aarch64 on a 8 x Raspberry Pi 4 cluster.
>>
>> The symptoms I’m experiencing are that the HPL Linpack performance in
>> Gflops increases on a single core as NB is increased from 32 to 256. The
>> theoretical maximum is 6 Gflops per core. I can achieve 4.8 Gflops, which I
>> think is a reasonable expectation. However, as I add more cores on a single
>> node, 2, 3 and finally 4 cores, the performance scaling is nowhere near
>> linear, and tails off dramatically as NB is increased. I can achieve 15
>> Gflops on a single node of 4 cores, whereas the theoretical maximum is 24
>> Gflops per node.
>>
>> opmi_info suggest vader is available/working…
>>
>>                  MCA btl: openib (MCA v2.1.0, API v3.1.0, Component
>> v4.0.3)
>>                  MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>>                  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>>                  MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)
>>
>> I’m wondering whether the Ubuntu kernel CMA_SIZE_MBYTES=5 is limiting
>> Open-MPI message number/size. So, I’m currently building a new kernel with
>> CMA_SIZE_MBYTES=16.
>>
>> I have attached 2 plots from my experiments…
>>
>> Plot 1 - shows an increase in Gflops for 1 core as NB increases, up to a
>> maximum value of 4.75 Gflops when NB = 240.
>>
>> Plot 2 - shows an increase in Gflops for 4 x cores (all on same the same
>> node) as NB increases. The maximum Gflops achieved is 15 Gflops. I would
>> hope that rather than drop off dramatically at NB = 168, the performance
>> would trend upwards towards somewhere near 4 x 4.75 = 19 Gflops.
>>
>> This is why I wondering whether Open-MPI messages via vader are being
>> hampered by a limiting CMA size.
>>
>> Lets see what happens with my new kernel...
>>
>> Best regards
>>
>> John
>>
>>
>>

Reply via email to