On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar <krunalbaus...@gmail.com> wrote: > 1. CAS patch (applied on the baseline) > - Kunpeng: 10-45% improvement observed [1] > - Graviton2: 30-50% improvement observed [2]
What does lower boundary of improvement mean? Does it mean minimal improvement observed? Obviously not, because there is no improvement with a low number of clients or read-only benchmark. > - M1: Only select results are available cas continue to maintain a > marginal gain but not significant. [3] Read-only benchmark doesn't involve spinlocks (I've just rechecked this). So, this difference is purely speculative. Also, Tom observed the regression [1]. The benchmark is artificial, but it may correspond to some real workload with heavily-loaded spinlocks. And that might have an explanation. ldrex/strex themselves don't work as memory barrier (this is why compiler adds explicit memory barrier afterwards). And I bet ldrex unpaired with strex could see an outdated value. On high-contended spinlocks that may cause too pessimistic waits. > 2. Let's ignore CAS for a sec and just think of LSE independently > - Kunpeng: regression observed Yeah, it's sad that there are ARM chips, where making the same action in a single instruction is slower than the same actions in multiple instructions. > - Graviton2: gain observed Have to say it's 5.5x improvement, which is dramatically more than CAS-loop patch can give. > - M1: regression observed > [while lse probably is default explicitly enabling it with +lse causes > regression on the head itself [4]. > client=2/4: 1816/714 ---- vs ---- 892/610] This is plain false. LSE was enabled in all the versions tested in M1 [2]. So not LSE instructions or +lse option caused the regression, but lack of other options enabled by default in Apple's clang. > Let me know what do you think about this analysis and any specific direction > that we should consider to help move forward. In order to move forward with ARM-optimized spinlocks I would investigate other options (at least [3]). Regarding CAS spinlock patch I can propose the following steps to clarify the things. 1. Run Tom's spinlock test on ARM machines other than M1. 2. Try to construct a pgbench script, which produces heavily-loaded spinlock without custom C-function, and also run it across various ARM machines. Links 1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us 2. https://www.postgresql.org/message-id/1274781.1606760475%40sss.pgh.pa.us 3. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html ------ Regards, Alexander Korotkov