Re: Improving spin-lock implementation on ARM.

2020-12-14 Thread Krunal Bauskar
Wondering if we can take this to completion (any idea what more we could do?). On Thu, 10 Dec 2020 at 14:48, Krunal Bauskar wrote: > > On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar > wrote: > >> >> >> On Thu, 3 Dec 2020 at 21:32, Tom Lane wrote: >> >>> Krunal Bauskar writes: >>> > Any updates o

Re: Improving spin-lock implementation on ARM.

2020-12-10 Thread Krunal Bauskar
On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar wrote: > > > On Thu, 3 Dec 2020 at 21:32, Tom Lane wrote: > >> Krunal Bauskar writes: >> > Any updates or further inputs on this. >> >> As far as LSE goes: my take is that tampering with the >> compiler/platform's default optimization options requires

Re: Improving spin-lock implementation on ARM.

2020-12-08 Thread Krunal Bauskar
On Thu, 3 Dec 2020 at 21:32, Tom Lane wrote: > Krunal Bauskar writes: > > Any updates or further inputs on this. > > As far as LSE goes: my take is that tampering with the > compiler/platform's default optimization options requires *very* > strong evidence, which we have not got and likely won't

Re: Improving spin-lock implementation on ARM.

2020-12-06 Thread Amit Khandekar
On Sat, 5 Dec 2020 at 02:55, Alexander Korotkov wrote: > > On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > > Let me know what do you think about this analysis and any specific > > direction that we should consider to help move forward. > > BTW, it would be also nice to benchmark my lwlock

Re: Improving spin-lock implementation on ARM.

2020-12-04 Thread Alexander Korotkov
On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > Let me know what do you think about this analysis and any specific direction > that we should consider to help move forward. BTW, it would be also nice to benchmark my lwlock patch on the Kunpeng. I'm very optimistic about this patch, but i

Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Alexander Korotkov
On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar wrote: > 1. CAS patch (applied on the baseline) >- Kunpeng: 10-45% improvement observed [1] >- Graviton2: 30-50% improvement observed [2] What does lower boundary of improvement mean? Does it mean minimal improvement observed? Obviously not,

Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Alexander Korotkov
On Thu, Dec 3, 2020 at 7:02 PM Tom Lane wrote: > From a system structural standpoint, I seriously dislike that lwlock.c > patch: putting machine-specific variant implementations into that file > seems like a disaster for maintainability. So it would need to show a > very significant gain across a

Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Tom Lane
Krunal Bauskar writes: > Any updates or further inputs on this. As far as LSE goes: my take is that tampering with the compiler/platform's default optimization options requires *very* strong evidence, which we have not got and likely won't get. Users who are building for specific hardware can ch

Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Krunal Bauskar
Any updates or further inputs on this. On Wed, 2 Dec 2020 at 09:27, Krunal Bauskar wrote: > > > On Tue, 1 Dec 2020 at 22:19, Tom Lane wrote: > >> Alexander Korotkov writes: >> > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar >> wrote: >> >> I would request you guys to re-think it from this per

Re: Improving spin-lock implementation on ARM.

2020-12-02 Thread Zidenberg, Tsahi
> On 01/12/2020, 19:08, "Alexander Korotkov" wrote: >BTW, what number of clients did you use? I can't find it in your message. Sure. Important params seem to be: Pgbench: Clients: 256 pgbench_jobs : 32 Scale: 1000 fill_factor: 90 Postgresql: shared buffers: 31GB max_connections: 1024 T

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 22:19, Tom Lane wrote: > Alexander Korotkov writes: > > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar > wrote: > >> I would request you guys to re-think it from this perspective to help > ensure that PGSQL can scale well on ARM. > >> s_lock becomes a top-most function and

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Zidenberg, Tsahi
> On 01/12/2020, 16:59, "Alexander Korotkov" wrote: > On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar wrote: > > FWIW, here is an earlier discussion on the same (also added the >> proposal author here) : Thanks for looping me in! >> >> > https://www.postgresql.org/message-id/flat/

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Alexander Korotkov
On Tue, Dec 1, 2020 at 7:57 PM Zidenberg, Tsahi wrote: > > On 01/12/2020, 16:59, "Alexander Korotkov" wrote: > > On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar > > wrote: > > > FWIW, here is an earlier discussion on the same (also added the > >> proposal author here) : > > Thanks for loop

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Tom Lane
Alexander Korotkov writes: > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar wrote: >> I would request you guys to re-think it from this perspective to help ensure >> that PGSQL can scale well on ARM. >> s_lock becomes a top-most function and LSE is not a universal solution but >> CAS surely help

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Alexander Korotkov
On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar wrote: > I would request you guys to re-think it from this perspective to help ensure > that PGSQL can scale well on ARM. > s_lock becomes a top-most function and LSE is not a universal solution but > CAS surely helps ease the main bottleneck. CAS p

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 20:25, Alexander Korotkov wrote: > On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar > wrote: > > I have completed benchmarking with lse. > > > > Graph attached. > > Thank you for benchmarking. > > Now I agree with this comment by Tom Lane > > > In general, I'm pretty skeptical

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Alexander Korotkov
On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar wrote: > On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar wrote: > > What I meant was outline-atomics support was added in GCC-9.4 and was made > > default in gcc-10. > > LSE support is present for quite some time. > > FWIW, here is an earlier discussion

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Alexander Korotkov
On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar wrote: > I have completed benchmarking with lse. > > Graph attached. Thank you for benchmarking. Now I agree with this comment by Tom Lane > In general, I'm pretty skeptical of *all* the results posted so far on > this thread, because everybody seem

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 02:16, Alexander Korotkov wrote: > On Mon, Nov 30, 2020 at 9:21 PM Tom Lane wrote: > > Alexander Korotkov writes: > > > I tend to think that LSE is enabled by default in Apple's clang based > > > on your previous message[1]. In order to dispel the doubts could you > > > p

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Amit Khandekar
On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar wrote: > What I meant was outline-atomics support was added in GCC-9.4 and was made > default in gcc-10. > LSE support is present for quite some time. FWIW, here is an earlier discussion on the same (also added the proposal author here) : https://www.

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 15:16, Alexander Korotkov wrote: > On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar > wrote: > > On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov > wrote: > >> BTW, how do you get that required gcc version is 9.4? I've managed to > >> use LSE with gcc 9.3. > > > > Did they ba

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Alexander Korotkov
On Tue, Dec 1, 2020 at 9:01 AM Tom Lane wrote: > I did what I could in this department. It's late and I'm not going to > have time to run read/write benchmarks before bed, but here are some > results for the "pgbench -S" cases. I tried to match your testing > choices, but could not entirely: > >

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Alexander Korotkov
On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar wrote: > On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov wrote: >> BTW, how do you get that required gcc version is 9.4? I've managed to >> use LSE with gcc 9.3. > > Did they backported it to 9.3? > I am just looking at the gcc guide. > https://gcc.g

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Tom Lane
Alexander Korotkov writes: > 2) None of the patches considered in this thread give a clear > advantage for PostgreSQL built with LSE. Yeah, I think so. > To further confirm this let's wait for Kunpeng 920 tests by Krunal > Bauskar and Amit Khandekar. Also it would be nice if someone will run >

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov wrote: > On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar > wrote: > > 3. Problem with GCC approach is still a lot of distro don't support gcc > 9.4 as default. > > To use this approach: > > * PGSQL will have to roll out its packages using gc

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Alexander Korotkov
On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar wrote: > 3. Problem with GCC approach is still a lot of distro don't support gcc 9.4 > as default. > To use this approach: > * PGSQL will have to roll out its packages using gcc-9.4+ only so that > they are compatible with all aarch64 machin

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Alexander Korotkov
On Mon, Nov 30, 2020 at 9:21 PM Tom Lane wrote: > Alexander Korotkov writes: > > I tend to think that LSE is enabled by default in Apple's clang based > > on your previous message[1]. In order to dispel the doubts could you > > please provide assembly of SpinLockAcquire for following clang > > o

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Tom Lane
Alexander Korotkov writes: > I tend to think that LSE is enabled by default in Apple's clang based > on your previous message[1]. In order to dispel the doubts could you > please provide assembly of SpinLockAcquire for following clang > options. > "-O2" > "-O2 -march=armv8-a+lse" > "-O2 -march=ar

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Alexander Korotkov
On Mon, Nov 30, 2020 at 9:08 AM Tom Lane wrote: > Krunal Bauskar writes: > > On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: > >> The results I posted at [1] seem to contradict this for Apple's new > >> machines. > > > For the results you saw on Mac-Mini was LSE enabled by default. > > Hmm, I don'

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Alexander Korotkov
On Mon, Nov 30, 2020 at 9:20 AM Krunal Bauskar wrote: > Some of us may be surprised by the fact that enabling lse is causing > regression (1816 -> 892 or 714 -> 610) with HEAD itself. > While lse is meant to improve the performance. This, unfortunately, is not > always the case at-least based on

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Krunal Bauskar
On Mon, 30 Nov 2020 at 11:38, Tom Lane wrote: > Krunal Bauskar writes: > > On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: > >> The results I posted at [1] seem to contradict this for Apple's new > >> machines. > > > For the results you saw on Mac-Mini was LSE enabled by default. > > Hmm, I don't

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Tom Lane
Krunal Bauskar writes: > On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: >> The results I posted at [1] seem to contradict this for Apple's new >> machines. > For the results you saw on Mac-Mini was LSE enabled by default. Hmm, I don't know how to get Apple's clang to admit what its default setti

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Krunal Bauskar
On Mon, 30 Nov 2020 at 10:14, Tom Lane wrote: > Krunal Bauskar writes: > > So given all the permutations and combinations, I think we could approach > > the problem as follows: > > > * Enable use of CAS as it is known to have optimal performance (vs TAS) > > The results I posted at [1] seem to c

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Tom Lane
Krunal Bauskar writes: > So given all the permutations and combinations, I think we could approach > the problem as follows: > * Enable use of CAS as it is known to have optimal performance (vs TAS) The results I posted at [1] seem to contradict this for Apple's new machines. In general, I'm pr

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Krunal Bauskar
On Sun, 29 Nov 2020 at 22:23, Alexander Korotkov wrote: > On Sat, Nov 28, 2020 at 1:31 PM Alexander Korotkov > wrote: > > I guess that might depend on the implementation of CAS and TAS. I bet > > usage of CAS in spinlock gives advantage when ldxr/stxr are used, but > > not when swpal/casa are u

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Alexander Korotkov
On Thu, Nov 26, 2020 at 7:35 AM Krunal Bauskar wrote: > * x86 uses optimized xchg operation. > ARM too started supporting it (using Large System Extension) with > ARM-v8.1 but since it not supported with ARM-v8, GCC default tends > to roll more generic load-store assembly code. > > * gcc-9.4

Re: Improving spin-lock implementation on ARM.

2020-11-28 Thread Alexander Korotkov
On Sat, Nov 28, 2020 at 5:36 AM Tom Lane wrote: > So at least on Apple's hardware, it seems like the CAS > implementation might be a shade faster when uncontended, > but it's very clearly worse when there is contention for > the spinlock. That's interesting, because the argument > that CAS should

Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Tom Lane
I wrote: > It might be that this hardware is capable of showing a difference with a > better-tuned pgbench test, but with an untuned pgbench run, we just aren't > sufficiently sensitive to the spinlock properties. (Which I guess is good > news, really.) It occurred to me that if we don't insist o

Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Tom Lane
Peter Eisentraut writes: > I tried this on a M1 MacBook Air. I cannot reproduce these results. > The unpatched numbers are about in the neighborhood of what you showed, > but the patched numbers are only about a few percent better, not the > 1.5x or 2x change that you showed. After redoing th

Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Peter Eisentraut
On 2020-11-26 23:55, Tom Lane wrote: ... and, after retrieving my jaw from the floor, I present the attached. Apple's chips evidently like this style of spinlock a LOT better. The difference is so remarkable that I wonder if I made a mistake somewhere. Can anyone else replicate these results?

Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Alexander Korotkov
On Fri, Nov 27, 2020 at 11:55 AM Michael Paquier wrote: > Not planning to buy one here, anything I have read on that tells that > it is worth a performance study. Another interesting area for experiments is AWS graviton2 instances. Specification says it supports arm v8.2, so it should have swpal/

Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Michael Paquier
On Fri, Nov 27, 2020 at 02:50:30AM -0500, Tom Lane wrote: > Yeah, that wasn't making sense to me either. The most likely explanation > seems to be that I messed up the test somehow ... but I don't see where. > So, again, I'm wondering if anyone else can replicate or refute this. I do find your re

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Tom Lane
Alexander Korotkov writes: > On Fri, Nov 27, 2020 at 1:55 AM Tom Lane wrote: >> ... and, after retrieving my jaw from the floor, I present the >> attached. Apple's chips evidently like this style of spinlock a LOT >> better. The difference is so remarkable that I wonder if I made a >> mistake s

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Alexander Korotkov
On Fri, Nov 27, 2020 at 2:20 AM Tom Lane wrote: > Alexander Korotkov writes: > > On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas wrote: > >> Is there some official ARM documentation, like a programmer's reference > >> manual or something like that, that would show a reference > >> implementat

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Alexander Korotkov
On Fri, Nov 27, 2020 at 1:55 AM Tom Lane wrote: > > Krunal Bauskar writes: > > On Thu, 26 Nov 2020 at 10:50, Tom Lane wrote: > >> Also, exactly what hardware/software platform were these curves > >> obtained on? > > > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > >

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Tom Lane
Alexander Korotkov writes: > On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas wrote: >> Is there some official ARM documentation, like a programmer's reference >> manual or something like that, that would show a reference >> implementation of a spinlock on ARM? It would be good to refer to an >

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Tom Lane
Krunal Bauskar writes: > On Thu, 26 Nov 2020 at 10:50, Tom Lane wrote: >> Also, exactly what hardware/software platform were these curves >> obtained on? > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > server and 8 for client) [2 numa nodes] > Storage: 3.2 TB NVMe

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Krunal Bauskar
On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas wrote: > On 26/11/2020 06:30, Krunal Bauskar wrote: > > Improving spin-lock implementation on ARM. > > > > > > * Spin-Lock is known to have a signif

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Alexander Korotkov
On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas wrote: > On 26/11/2020 06:30, Krunal Bauskar wrote: > > Improving spin-lock implementation on ARM. > > > > > > * Spin-Lock is known to have a signif

Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Heikki Linnakangas
On 26/11/2020 06:30, Krunal Bauskar wrote: Improving spin-lock implementation on ARM. * Spin-Lock is known to have a significant effect on performance   with increasing scalability. * Existing Spin-Lock implementation for ARM is sub

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Amit Khandekar
On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar wrote: > Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for > server and 8 for client) [2 numa nodes] > Storage: 3.2 TB NVMe SSD > OS: CentOS Linux release 7.6 > PGSQL: baseline = Release Tag 13.1 > Invocation suite: > https://

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
On Thu, 26 Nov 2020 at 10:50, Tom Lane wrote: > Michael Paquier writes: > > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: > >> (Thanks to Amit Khandekar for rigorously performance testing this patch > >> with different combinations). > > > For the simple-update and tpcb-like gr

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Tom Lane
Michael Paquier writes: > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: >> (Thanks to Amit Khandekar for rigorously performance testing this patch >> with different combinations). > For the simple-update and tpcb-like graphs, do you have any actual > numbers to share between 128

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
scalability baseline patched ---- -- updatetpcb update tpcb -- 128 107932 78554 108081 78569 2568

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Michael Paquier
On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote: > (Thanks to Amit Khandekar for rigorously performance testing this patch > with different combinations). For the simple-update and tpcb-like graphs, do you have any actual numbers to share between 128 and 1024 connections? The blue

Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
Improving spin-lock implementation on ARM. * Spin-Lock is known to have a significant effect on performance with increasing scalability. * Existing Spin-Lock implementation for ARM is sub-optimal due to use of TAS (test and swap