Re: Improving spin-lock implementation on ARM.

Alexander Korotkov Thu, 03 Dec 2020 12:01:39 -0800

On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar <krunalbaus...@gmail.com> wrote:
> 1. CAS patch (applied on the baseline)
>    - Kunpeng: 10-45% improvement observed [1]
>    - Graviton2: 30-50% improvement observed [2]


What does lower boundary of improvement mean?  Does it mean minimal
improvement observed?  Obviously not, because there is no improvement
with a low number of clients or read-only benchmark.

>    - M1: Only select results are available cas continue to maintain a 
> marginal gain but not significant. [3]

Read-only benchmark doesn't involve spinlocks (I've just rechecked
this).  So, this difference is purely speculative.

Also, Tom observed the regression [1].  The benchmark is artificial,
but it may correspond to some real workload with heavily-loaded
spinlocks.  And that might have an explanation.  ldrex/strex
themselves don't work as memory barrier (this is why compiler adds
explicit memory barrier afterwards).  And I bet ldrex unpaired with
strex could see an outdated value.  On high-contended spinlocks that
may cause too pessimistic waits.

> 2. Let's ignore CAS for a sec and just think of LSE independently
>    - Kunpeng: regression observed

Yeah, it's sad that there are ARM chips, where making the same action
in a single instruction is slower than the same actions in multiple
instructions.

>    - Graviton2: gain observed

Have to say it's 5.5x improvement, which is dramatically more than
CAS-loop patch can give.

>    - M1: regression observed
>      [while lse probably is default explicitly enabling it with +lse causes 
> regression on the head itself [4].
>       client=2/4: 1816/714 ---- vs  ---- 892/610]

This is plain false.  LSE was enabled in all the versions tested in M1
[2].  So not LSE instructions or +lse option caused the regression,
but lack of other options enabled by default in Apple's clang.

> Let me know what do you think about this analysis and any specific direction 
> that we should consider to help move forward.

In order to move forward with ARM-optimized spinlocks I would
investigate other options (at least [3]).

Regarding CAS spinlock patch I can propose the following steps to
clarify the things.
1. Run Tom's spinlock test on ARM machines other than M1.
2. Try to construct a pgbench script, which produces heavily-loaded
spinlock without custom C-function, and also run it across various ARM
machines.

Links
1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us
2. https://www.postgresql.org/message-id/1274781.1606760475%40sss.pgh.pa.us
3. 
https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

------
Regards,
Alexander Korotkov

Re: Improving spin-lock implementation on ARM.

Reply via email to