Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov 
wrote:

> On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar 
> wrote:
> > 3. Problem with GCC approach is still a lot of distro don't support gcc
> 9.4 as default.
> > To use this approach:
> > * PGSQL will have to roll out its packages using gcc-9.4+ only so
> that they are compatible with all aarch64 machines
> > * but this continues to affect all other users who tend to build
> pgsql using standard distro based compiler. (unless they upgrade compiler).
>
> I think if a user, who runs PostgreSQL on a multicore machine with
> high-concurrent load, can take care about installing the appropriate
> package/using the appropriate compiler (especially if we publish
> explicit instructions for that).  In the same way such advanced users
> tune Linux kernel etc.
>
> BTW, how do you get that required gcc version is 9.4?  I've managed to
> use LSE with gcc 9.3.
>

Did they backported it to 9.3?
I am just looking at the gcc guide.
https://gcc.gnu.org/gcc-9/changes.html

GCC 9.4Target Specific ChangesAArch64

   - The option -moutline-atomics has been added to aid deployment of the
   Large System Extensions (LSE)



>
> --
> Regards,
> Alexander Korotkov
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 15:16, Alexander Korotkov 
wrote:

> On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar 
> wrote:
> > On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov 
> wrote:
> >> BTW, how do you get that required gcc version is 9.4?  I've managed to
> >> use LSE with gcc 9.3.
> >
> > Did they backported it to 9.3?
> > I am just looking at the gcc guide.
> > https://gcc.gnu.org/gcc-9/changes.html
> >
> > GCC 9.4
> >
> > Target Specific Changes
> >
> > AArch64
> >
> > The option -moutline-atomics has been added to aid deployment of the
> Large System Extensions (LSE)
>
> No, you've misread this release notes item.  This item relates to a
> new option for LSE support.  See the rest of the description.
>

Some confusion here.

What I meant was outline-atomics support was added in GCC-9.4 and was made
default in gcc-10.
LSE support is present for quite some time.

In one of my up-thread reply  I mentioned about the outline atomic that
would help us enable lse if underline architecture support it.


2. As you pointed out LSE is enabled starting only with arm-v8.1 but not
all aarch64 tag machines are arm-v8.1 compatible.
This means we would need a separate package or a more optimal way would
be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with
-moutline-atomics that would emit both traditional and lse code and
flow would dynamically select depending on the target machine
(I have blogged about it in MySQL context
https://mysqlonarm.github.io/ARM-LSE-and-MySQL/)


Problem with this approach is it would need gcc-9.4+ (I also ready
gcc-9.3.1 would do) or gcc10.x but most of the distro doesn't support it so
user using old compiler especially
default one shipped with OS would not be able to take advantage of this.


>
> "When the option is specified code is emitted to detect the presence
> of LSE instructions at runtime and use them for standard atomic
> operations. For more information please refer to the documentation."
>
> LSE support itself was introduced in gcc 6.
> https://gcc.gnu.org/gcc-6/changes.html
>  * The ARMv8.1-A architecture and the Large System Extensions are now
> supported. They can be used by specifying the -march=armv8.1-a option.
> Additionally, the +lse option extension can be used in a similar
> fashion to other option extensions. The Large System Extensions
> introduce new instructions that are used in the implementation of
> atomic operations.
>
> But, -moutline-atomics looks very interesting itself.  I think we
> should add this option to our arm64 packages if we haven't already.
>
> --
> Regards,
> Alexander Korotkov
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 02:16, Alexander Korotkov 
wrote:

> On Mon, Nov 30, 2020 at 9:21 PM Tom Lane  wrote:
> > Alexander Korotkov  writes:
> > > I tend to think that LSE is enabled by default in Apple's clang based
> > > on your previous message[1].  In order to dispel the doubts could you
> > > please provide assembly of SpinLockAcquire for following clang
> > > options.
> > > "-O2"
> > > "-O2 -march=armv8-a+lse"
> > > "-O2 -march=armv8-a"
> >
> > Huh.  Those options make exactly zero difference to the code generated
> > for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread,
> > for either the HEAD definition of TAS() or the CAS patch's version.
> >
> > So now I'm at a loss as to the reason for the performance difference
> > I got.  -march=armv8-a+lse does make a difference to code generation
> > someplace, because the overall size of the postgres executable changes
> > by 16kB or so.  One might argue that the performance difference is due
> > to better code elsewhere than the spinlocks ... but the test I'm running
> > is basically just
> >
> > while (count-- > 0)
> > {
> > XLogGetLastRemovedSegno();
> >
> > CHECK_FOR_INTERRUPTS();
> > }
> >
> > so it's hard to see where a non-spinlock-related code change would come
> > in.  That loop itself definitely generates the same code either way.
> >
> > I did find this interesting output from "clang -v":
> >
> > -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8
> -target-feature +neon -target-feature +crc -target-feature +crypto
> -target-feature +fullfp16 -target-feature +ras -target-feature +lse
> -target-feature +rdm -target-feature +rcpc -target-feature +zcm
> -target-feature +zcz -target-feature +sha2 -target-feature +aes
> >
> > whereas adding -march=armv8-a+lse changes that to just
> >
> > -target-cpu vortex -target-feature +neon -target-feature +lse
> -target-feature +zcm -target-feature +zcz
> >
> > On the whole, that would make one think that -march=armv8-a+lse
> > should produce worse code than the default settings.
>
> Great, thanks.
>
> So, I think the following hypothesis isn't disproved yet.
> 1) On ARM with LSE support, PostgreSQL built with LSE is faster than
> PostgreSQL built without LSE.  Even if the latter is patched with
> anything considered in this thread.
> 2) None of the patches considered in this thread give a clear
> advantage for PostgreSQL built with LSE.
>
> To further confirm this let's wait for Kunpeng 920 tests by Krunal
> Bauskar and Amit Khandekar.  Also it would be nice if someone will run
> benchmarks similar to [1] on Apple M1.
>

--

I have completed benchmarking with lse.

Graph attached.

* For me, lse failed to perform. With lse enabled performance with higher
scalability was infact less than baseline (with TAS).
  [This is true with and without patch (cas)]. So lse is unable to push
things.

* Graph traces all 4 lines (baseline (tas), baseline-patched (cas),
baseline-lse (tas+lse), baseline-patched-lse (cas+lse))

- for select, there is no clear winner but baseline-patched-lse (cas+lse)
perform better than others (2-4%).

- for update/tpcb like baseline-patched (cas) continue to outperform all
the 3 options (10-40%).
  In fact, with lse enabled a regression (compared to baseline/head) is
observed.

[I am not surprised since MySQL showed the same behavior with lse of-course
with a lesser percentage].

I have repeated the test 2 times to confirm the behavior.
Also, to reduce noise I normally run 4 rounds discarding 1st round and
taking the median of the last 3 rounds.

In all, I see consistent numbers.
conf:
https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf

---

>From all the observations till now following points are clear:

* Patch doesn't seem to make a significant difference with lse enabled.
  Ir-respective of the machine (Kunpeng, Graviton2, M1). [infact introduces
regression with Kunpeng].

* Patch helps generate significant +ve difference in performace with lse
disabled.
  Observed and Confirmed with both Kunpeng, Graviton2.
  (not possible to disable lse on m1).

* Does the patch makes things worse with lse enabled.
  * Graviton2: No (based on Alexander graph)
  * M1: No (based on Tom's

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 20:25, Alexander Korotkov 
wrote:

> On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar 
> wrote:
> > I have completed benchmarking with lse.
> >
> > Graph attached.
>
> Thank you for benchmarking.
>
> Now I agree with this comment by Tom Lane
>
> > In general, I'm pretty skeptical of *all* the results posted so far on
> > this thread, because everybody seems to be testing exactly one machine.
> > If there's one thing that it's safe to assume about ARM, it's that
> > there are a lot of different implementations; and this area seems very
> > very likely to differ across implementations.
>
> Different ARM implementations look too different.  As you pointed out,
> LSE is enabled in gcc-10 by default.  I doubt we can accept a patch,
> which gives benefits for specific platform and only when the compiler
> isn't very modern.  Also, we didn't cover all ARM planforms.  Given
> they are so different, we can't guarantee that patch doesn't cause
> regression of some ARM.  Additionally, the effect of the CAS patch
> even for Kunpeng seems modest.  It makes the drop off of TPS more
> smooth, but it doesn't change the trend.
>

There are 2 parts:

** Does CAS patch help scale PGSQL. Yes.*
** Is LSE beneficial for all architectures. Probably No.*

The patch addresses only the former one which is true for all cases.
(Enabling LSE should be an independent process).

gcc-10 made it default but when I read [1] it quotes that canonical decided
to remove it as default
as part of* Ubuntu-20.04 which means LSE has not proven the test of
canonical (probably).*
Also, most of the distro has not yet started shipping GCC-10 which is way
far before it makes it to all distro.

So if we keep the LSE effect aside and just look at the patch from
performance improvement it surely helps
achieve a good gain. I see an improvement in the range of 10-40%.
Amit during his independent testing also observed the gain in the same
range and your testing with G-2 has re-attested the same point.
Pardon me if this is modest as per pgsql standards.

With 1024 scalability PGSQL on other arches (beyond ARM) struggle to scale
so there is something more
inherent that needs to be addressed from a generic perspective.

Also, the said patch non-only helps pgbench kind of workload but other
workloads too.

--

I would request you guys to re-think it from this perspective to help
ensure that PGSQL can scale well on ARM.
s_lock becomes a top-most function and LSE is not a universal solution but
CAS surely helps ease the main bottleneck.

And surely let me know if more data is needed.

Link:
[1]:
https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com

> --
> Regards,
> Alexander Korotkov
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Krunal Bauskar
On Tue, 1 Dec 2020 at 22:19, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar 
> wrote:
> >> I would request you guys to re-think it from this perspective to help
> ensure that PGSQL can scale well on ARM.
> >> s_lock becomes a top-most function and LSE is not a universal solution
> but CAS surely helps ease the main bottleneck.
>
> > CAS patch isn't proven to be a universal solution as well.  We have
> > tested the patch on just a few processors, and Tom has seen the
> > regression [1].  The benchmark used by Tom was artificial, but the
> > results may be relevant for some real-life workload.
>
> Yeah.  I think that the main conclusion from what we've seen here is
> that on smaller machines like M1, a standard pgbench benchmark just
> isn't capable of driving PG into serious spinlock contention.  (That
> reflects very well on the work various people have done over the years
> to get rid of spinlock contention, because ten or so years ago it was
> a huge problem on this size of machine.  But evidently, not any more.)
> Per the results others have posted, nowadays you need dozens of cores
> and hundreds of client threads to measure any such issue with pgbench.
>
> So that is why I experimented with a special test that does nothing
> except pound on one spinlock.  Sure it's artificial, but if you want
> to see the effects of different spinlock implementations then it's
> just too hard to get any results with pgbench's regular scripts.
>
> And that's why it disturbs me that the CAS-spinlock patch showed up
> worse in that environment.  The fact that it's not visible in the
> regular pgbench test just means that the effect is too small to
> measure in that test.  But in a test where we *can* measure an effect,
> it's not looking good.
>
> It would be interesting to see some results from the same test I did
> on other processors.  I suspect the results would look a lot different
> from mine ... but we won't know unless someone does it.  Or, if someone
> wants to propose some other test case, let's have a look.
>
> > I'm expressing just my personal opinion, other committers can have
> > different opinions.  I don't particularly think this topic is
> > necessarily a non-starter.  But I do think that given ambiguity we've
> > observed in the benchmark, much more research is needed to push this
> > topic forward.
>
> Yeah.  I'm not here to say "do nothing".  But I think we need results
> from more machines and more test cases to convince ourselves whether
> there's a consistent, worthwhile win from any specific patch.
>

I think there is
*an ambiguity with lse and that has been the*
*source of some confusion* so let's make another attempt to
understand all the observations and then define the next steps.

-


*1. CAS patch (applied on the baseline)*   - Kunpeng: 10-45% improvement
observed [1]
   - Graviton2: 30-50% improvement observed [2]
   - M1: Only select results are available cas continue to maintain a
marginal gain but not significant. [3]
 [inline with what we observed with Kunpeng and Graviton2 for select
results too].


*2. Let's ignore CAS for a sec and just think of LSE independently*   -
Kunpeng: regression observed
   - Graviton2: gain observed
   - M1: regression observed
 [while lse probably is default explicitly enabling it with +lse causes
regression on the head itself [4].
  client=2/4: 1816/714  vs   892/610]

   There is enough reason not to immediately consider enabling LSE given
its unable to perform consistently on all hardware.
-

With those 2 aspects clear let's evaluate what options we have in hand


*1. Enable CAS approach*   *- What we gain:* pgsql scale on
Kunpeng/Graviton2
 (m1 awaiting read-write result but may marginally scale  [[5]: "but
the patched numbers are only about a few percent better"])
   *- What we lose:* Nothing for now.


*2. LSE:*   *- What we gain: *Scaled workload with Graviton2
  * - What we lose:* regression on M1 and Kunpeng.

Let's think of both approaches independently.

- Enabling CAS would help us scale on all hardware (Kunpeng/Graviton2/M1)
- Enabling LSE would help us scale only on some but regress on others.
  [LSE could be considered in the future once it stabilizes and all
hardware adapts to it]

---

*Let me know what do you think about this analysis and any specific
direction that we should consider to help move forward.*

--

Re: Improving spin-lock implementation on ARM.

2020-12-03 Thread Krunal Bauskar
Any updates or further inputs on this.

On Wed, 2 Dec 2020 at 09:27, Krunal Bauskar  wrote:

>
>
> On Tue, 1 Dec 2020 at 22:19, Tom Lane  wrote:
>
>> Alexander Korotkov  writes:
>> > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar 
>> wrote:
>> >> I would request you guys to re-think it from this perspective to help
>> ensure that PGSQL can scale well on ARM.
>> >> s_lock becomes a top-most function and LSE is not a universal solution
>> but CAS surely helps ease the main bottleneck.
>>
>> > CAS patch isn't proven to be a universal solution as well.  We have
>> > tested the patch on just a few processors, and Tom has seen the
>> > regression [1].  The benchmark used by Tom was artificial, but the
>> > results may be relevant for some real-life workload.
>>
>> Yeah.  I think that the main conclusion from what we've seen here is
>> that on smaller machines like M1, a standard pgbench benchmark just
>> isn't capable of driving PG into serious spinlock contention.  (That
>> reflects very well on the work various people have done over the years
>> to get rid of spinlock contention, because ten or so years ago it was
>> a huge problem on this size of machine.  But evidently, not any more.)
>> Per the results others have posted, nowadays you need dozens of cores
>> and hundreds of client threads to measure any such issue with pgbench.
>>
>> So that is why I experimented with a special test that does nothing
>> except pound on one spinlock.  Sure it's artificial, but if you want
>> to see the effects of different spinlock implementations then it's
>> just too hard to get any results with pgbench's regular scripts.
>>
>> And that's why it disturbs me that the CAS-spinlock patch showed up
>> worse in that environment.  The fact that it's not visible in the
>> regular pgbench test just means that the effect is too small to
>> measure in that test.  But in a test where we *can* measure an effect,
>> it's not looking good.
>>
>> It would be interesting to see some results from the same test I did
>> on other processors.  I suspect the results would look a lot different
>> from mine ... but we won't know unless someone does it.  Or, if someone
>> wants to propose some other test case, let's have a look.
>>
>> > I'm expressing just my personal opinion, other committers can have
>> > different opinions.  I don't particularly think this topic is
>> > necessarily a non-starter.  But I do think that given ambiguity we've
>> > observed in the benchmark, much more research is needed to push this
>> > topic forward.
>>
>> Yeah.  I'm not here to say "do nothing".  But I think we need results
>> from more machines and more test cases to convince ourselves whether
>> there's a consistent, worthwhile win from any specific patch.
>>
>
> I think there is
> *an ambiguity with lse and that has been the*
> *source of some confusion* so let's make another attempt to
> understand all the observations and then define the next steps.
>
> -
>
>
> *1. CAS patch (applied on the baseline)*   - Kunpeng: 10-45% improvement
> observed [1]
>- Graviton2: 30-50% improvement observed [2]
>- M1: Only select results are available cas continue to maintain a
> marginal gain but not significant. [3]
>  [inline with what we observed with Kunpeng and Graviton2 for select
> results too].
>
>
> *2. Let's ignore CAS for a sec and just think of LSE independently*   -
> Kunpeng: regression observed
>- Graviton2: gain observed
>- M1: regression observed
>  [while lse probably is default explicitly enabling it with +lse
> causes regression on the head itself [4].
>   client=2/4: 1816/714  vs   892/610]
>
>There is enough reason not to immediately consider enabling LSE given
> its unable to perform consistently on all hardware.
> -
>
> With those 2 aspects clear let's evaluate what options we have in hand
>
>
> *1. Enable CAS approach*   *- What we gain:* pgsql scale on
> Kunpeng/Graviton2
>  (m1 awaiting read-write result but may marginally scale  [[5]: "but
> the patched numbers are only about a few percent better"])
>*- What we lose:* Nothing for now.
>
>
> *2. LSE:*   *- What we gain: *Scaled workload with Graviton2
>   * - What we lose:* regression on M1 and Kunpeng.
>
> Let's think of both approaches

Re: Improving spin-lock implementation on ARM.

2020-12-08 Thread Krunal Bauskar
On Thu, 3 Dec 2020 at 21:32, Tom Lane  wrote:

> Krunal Bauskar  writes:
> > Any updates or further inputs on this.
>
> As far as LSE goes: my take is that tampering with the
> compiler/platform's default optimization options requires *very*
> strong evidence, which we have not got and likely won't get.  Users
> who are building for specific hardware can choose to supply custom
> CFLAGS, of course.  But we shouldn't presume to do that for them,
> because we don't know what they are building for, or with what.
>
> I'm very willing to consider the CAS spinlock patch, but it still
> feels like there's not enough evidence to show that it's a universal
> win.  The way to move forward on that is to collect more measurements
> on additional ARM-based platforms.  And I continue to think that
> pgbench is only a very crude tool for testing spinlock performance;
> we should look at other tests.
>

Thanks Tom.

Given pg-bench limited option I decided to try things with sysbench to
expose
the real contention using zipfian type (zipfian pattern causes part of the
database
to get updated there-by exposing main contention point).


*Baseline for 256 threads update-index use-case:*
-   44.24%174935  postgres postgres [.] s_lock
transactions:
transactions:5587105 (92988.40 per sec.)

*Patched for 256 threads update-index use-case:*
 0.02%80  postgres  postgres  [.] s_lock
transactions:
transactions:10288781 (171305.24 per sec.)

*perf diff*

* 0.02%+44.22%  postgres [.] s_lock*


As we see from the above result s_lock is exposing major contention that
could be relaxed using the
said cas patch. Performance improvement in range of 80% is observed.

Taking this guideline we decided to run it for all scalability for update
and non-update use-case.
Check the attached graph. Consistent improvement is observed.

I presume this should help re-establish that for major contention cases
existing tas approach will always give up.

---

Unfortunately, I don't have access to different ARM arch except for Kunpeng
or Graviton2 where
we have already proved the value of the patch.
[ref: Apple M1 as per your evaluation patch doesn't show regression for
select. Maybe if possible can you try update scenarios too].

Do you know anyone from the community who has access to other ARM arches we
can request them to evaluate?
But since it is has proven on 2 independent ARM arch I am pretty confident
it will scale with other ARM arches too.


>
> From a system structural standpoint, I seriously dislike that lwlock.c
> patch: putting machine-specific variant implementations into that file
> seems like a disaster for maintainability.  So it would need to show a
> very significant gain across a range of hardware before I'd want to
> consider adopting it ... and it has not shown that.
>
> regards, tom lane
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-12-10 Thread Krunal Bauskar
On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar  wrote:

>
>
> On Thu, 3 Dec 2020 at 21:32, Tom Lane  wrote:
>
>> Krunal Bauskar  writes:
>> > Any updates or further inputs on this.
>>
>> As far as LSE goes: my take is that tampering with the
>> compiler/platform's default optimization options requires *very*
>> strong evidence, which we have not got and likely won't get.  Users
>> who are building for specific hardware can choose to supply custom
>> CFLAGS, of course.  But we shouldn't presume to do that for them,
>> because we don't know what they are building for, or with what.
>>
>> I'm very willing to consider the CAS spinlock patch, but it still
>> feels like there's not enough evidence to show that it's a universal
>> win.  The way to move forward on that is to collect more measurements
>> on additional ARM-based platforms.  And I continue to think that
>> pgbench is only a very crude tool for testing spinlock performance;
>> we should look at other tests.
>>
>
> Thanks Tom.
>
> Given pg-bench limited option I decided to try things with sysbench to
> expose
> the real contention using zipfian type (zipfian pattern causes part of the
> database
> to get updated there-by exposing main contention point).
>
>
> 
> *Baseline for 256 threads update-index use-case:*
> -   44.24%174935  postgres postgres [.] s_lock
> transactions:
> transactions:5587105 (92988.40 per sec.)
>
> *Patched for 256 threads update-index use-case:*
>  0.02%80  postgres  postgres  [.] s_lock
> transactions:
> transactions:10288781 (171305.24 per sec.)
>
> *perf diff*
>
> * 0.02%+44.22%  postgres [.] s_lock*
> 
>
> As we see from the above result s_lock is exposing major contention that
> could be relaxed using the
> said cas patch. Performance improvement in range of 80% is observed.
>
> Taking this guideline we decided to run it for all scalability for update
> and non-update use-case.
> Check the attached graph. Consistent improvement is observed.
>
> I presume this should help re-establish that for major contention cases
> existing tas approach will always give up.
>
>
> ---
>
> Unfortunately, I don't have access to different ARM arch except for
> Kunpeng or Graviton2 where
> we have already proved the value of the patch.
> [ref: Apple M1 as per your evaluation patch doesn't show regression for
> select. Maybe if possible can you try update scenarios too].
>
> Do you know anyone from the community who has access to other ARM arches
> we can request them to evaluate?
> But since it is has proven on 2 independent ARM arch I am pretty confident
> it will scale with other ARM arches too.
>
>

Any direction on how we can proceed on this?

* We have tested it with both cloud vendors that provide ARM instances.
* We have tested it with Apple M1 (partially at-least)
* Ampere use to provide instance on packet.com but now it is an evaluation
program only.

No other active arm instance offering a cloud provider.

Given our evaluation so far has proven to be +ve can we think of
considering it on basis of the available
data which is quite encouraging with 80% improvement seen for heavy
contention use-cases.



>
>> From a system structural standpoint, I seriously dislike that lwlock.c
>> patch: putting machine-specific variant implementations into that file
>> seems like a disaster for maintainability.  So it would need to show a
>> very significant gain across a range of hardware before I'd want to
>> consider adopting it ... and it has not shown that.
>>
>> regards, tom lane
>>
>
>
> --
> Regards,
> Krunal Bauskar
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-12-14 Thread Krunal Bauskar
Wondering if we can take this to completion (any idea what more we could
do?).

On Thu, 10 Dec 2020 at 14:48, Krunal Bauskar 
wrote:

>
> On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar 
> wrote:
>
>>
>>
>> On Thu, 3 Dec 2020 at 21:32, Tom Lane  wrote:
>>
>>> Krunal Bauskar  writes:
>>> > Any updates or further inputs on this.
>>>
>>> As far as LSE goes: my take is that tampering with the
>>> compiler/platform's default optimization options requires *very*
>>> strong evidence, which we have not got and likely won't get.  Users
>>> who are building for specific hardware can choose to supply custom
>>> CFLAGS, of course.  But we shouldn't presume to do that for them,
>>> because we don't know what they are building for, or with what.
>>>
>>> I'm very willing to consider the CAS spinlock patch, but it still
>>> feels like there's not enough evidence to show that it's a universal
>>> win.  The way to move forward on that is to collect more measurements
>>> on additional ARM-based platforms.  And I continue to think that
>>> pgbench is only a very crude tool for testing spinlock performance;
>>> we should look at other tests.
>>>
>>
>> Thanks Tom.
>>
>> Given pg-bench limited option I decided to try things with sysbench to
>> expose
>> the real contention using zipfian type (zipfian pattern causes part of
>> the database
>> to get updated there-by exposing main contention point).
>>
>>
>> 
>> *Baseline for 256 threads update-index use-case:*
>> -   44.24%174935  postgres postgres [.]
>> s_lock
>> transactions:
>> transactions:5587105 (92988.40 per sec.)
>>
>> *Patched for 256 threads update-index use-case:*
>>  0.02%80  postgres  postgres  [.] s_lock
>> transactions:
>> transactions:10288781 (171305.24 per sec.)
>>
>> *perf diff*
>>
>> * 0.02%+44.22%  postgres [.] s_lock*
>> 
>>
>> As we see from the above result s_lock is exposing major contention that
>> could be relaxed using the
>> said cas patch. Performance improvement in range of 80% is observed.
>>
>> Taking this guideline we decided to run it for all scalability for update
>> and non-update use-case.
>> Check the attached graph. Consistent improvement is observed.
>>
>> I presume this should help re-establish that for major contention cases
>> existing tas approach will always give up.
>>
>>
>> ---
>>
>> Unfortunately, I don't have access to different ARM arch except for
>> Kunpeng or Graviton2 where
>> we have already proved the value of the patch.
>> [ref: Apple M1 as per your evaluation patch doesn't show regression for
>> select. Maybe if possible can you try update scenarios too].
>>
>> Do you know anyone from the community who has access to other ARM arches
>> we can request them to evaluate?
>> But since it is has proven on 2 independent ARM arch I am pretty
>> confident it will scale with other ARM arches too.
>>
>>
>
> Any direction on how we can proceed on this?
>
> * We have tested it with both cloud vendors that provide ARM instances.
> * We have tested it with Apple M1 (partially at-least)
> * Ampere use to provide instance on packet.com but now it is an
> evaluation program only.
>
> No other active arm instance offering a cloud provider.
>
> Given our evaluation so far has proven to be +ve can we think of
> considering it on basis of the available
> data which is quite encouraging with 80% improvement seen for heavy
> contention use-cases.
>
>
>
>>
>>> From a system structural standpoint, I seriously dislike that lwlock.c
>>> patch: putting machine-specific variant implementations into that file
>>> seems like a disaster for maintainability.  So it would need to show a
>>> very significant gain across a range of hardware before I'd want to
>>> consider adopting it ... and it has not shown that.
>>>
>>> regards, tom lane
>>>
>>
>>
>> --
>> Regards,
>> Krunal Bauskar
>>
>
>
> --
> Regards,
> Krunal Bauskar
>


-- 
Regards,
Krunal Bauskar


Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
Improving spin-lock implementation on ARM.


* Spin-Lock is known to have a significant effect on performance
  with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
  use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
  store operation will execute to replace the same value.
  This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
  lock is not free, check operation, post-load will cause the loop to
  return there-by saving on costlier store operation. [1]

* x86 uses optimized xchg operation.
  ARM too started supporting it (using Large System Extension) with
  ARM-v8.1 but since it not supported with ARM-v8, GCC default tends
  to roll more generic load-store assembly code.

* gcc-9.4+ onwards there is support for outline-atomics that could emit
  both the variants of the code (load-store and cas/swp) and based on
  underlying supported architecture proper variant it used but still a lot
  of distros don't support GCC-9.4 as the default compiler.

* In light of this, we would like to propose a CAS-based approach based on
  our local testing has shown improvement in the range of 10-40%.
  (attaching graph).

* Patch enables CAS based approach if the CAS is supported depending on
  existing compiled flag HAVE_GCC__ATOMIC_INT32_CAS

(Thanks to Amit Khandekar for rigorously performance testing this patch
 with different combinations).

[1]: https://godbolt.org/z/jqbEsa

P.S: Sorry if I missed any standard pgsql protocol since I am just starting
with pgsql.

---
Krunal Bauskar
#mysqlonarm
Huawei Technologies
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6..940fdcd 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,24 @@ tas(volatile slock_t *lock)
  * than other widths.
  */
 #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC__SYNC_INT32_TAS
+
+#ifdef HAVE_GCC__ATOMIC_INT32_CAS
+/* just reusing the same flag to avoid re-declaration of default tas functions below */
+#define HAS_TEST_AND_SET
+
+#define TAS(lock) cas(lock)
+typedef int slock_t;
+
+static __inline__ int
+cas(volatile slock_t *lock)
+{
+	slock_t expected = 0;
+	return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1,
+false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
+}
+
+#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE);
+#elif HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
 #define TAS(lock) tas(lock)


Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
scalability   baseline   patched
---- --
  updatetpcb  update tpcb
--
128   107932 78554  108081 78569
25682877 64682101543 73774
51255174 46494  77886 61105
1024   32267 27020 33170 30597

configuration:
https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf


On Thu, 26 Nov 2020 at 10:36, Michael Paquier  wrote:

> On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:
> > (Thanks to Amit Khandekar for rigorously performance testing this patch
> >  with different combinations).
>
> For the simple-update and tpcb-like graphs, do you have any actual
> numbers to share between 128 and 1024 connections?  The blue lines
> look like they are missing some measurements in-between, so it is hard
> to tell if this is an actual improvement or just some lack of data.
> --
> Michael
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
On Thu, 26 Nov 2020 at 10:50, Tom Lane  wrote:

> Michael Paquier  writes:
> > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:
> >> (Thanks to Amit Khandekar for rigorously performance testing this patch
> >> with different combinations).
>
> > For the simple-update and tpcb-like graphs, do you have any actual
> > numbers to share between 128 and 1024 connections?
>
> Also, exactly what hardware/software platform were these curves
> obtained on?
>

Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
server and 8 for client) [2 numa nodes]
Storage: 3.2 TB NVMe SSD
OS: CentOS Linux release 7.6
PGSQL: baseline = Release Tag 13.1
Invocation suite:
https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses
pgbench)



>         regards, tom lane
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-11-26 Thread Krunal Bauskar
On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas  wrote:

> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > 
> >
> > * Spin-Lock is known to have a significant effect on performance
> >with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> >use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not
> free,
> >store operation will execute to replace the same value.
> >This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> >lock is not free, check operation, post-load will cause the loop to
> >return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>

1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace
(store new value) -> else exit/break
TAS = load value -> store new value

This means each TAS would be converted to 2 operations that are LOAD and
STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this
regard with an early check.

Let's look at some micro-benchmarking. I implemented a simple spin-loop
using both approaches and
as expected with increase scalability, CAS continues to out-perform TAS by
a margin up to 50%.

 TAS 
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
 CAS 
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s

This could be also observed from the perf profiling

TAS:
 15.57 │44:   ldxr  w0, [x19]
 83.93 │  stxr  w1, w21, [x19]

CAS:
 81.29 │58: ↓ b.ne  cc

  9.86 │cc:   ldaxr w0, [x22]
  8.84 │  cmp   w0, #0x0
   │↑ b.ne  58
   │  stxr  w1, w20, [x22]

*In TAS: STORE is pretty costly.*

2.  I have added the needed comment in the patch. Updated patch attached.

--
Thanks for taking look at this and surely let me know if any more info is
needed.


>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.
>
> - Heikki
>


-- 
Regards,
Krunal Bauskar
#include 
#include 
#include 
#include 
#include 

#define likely(x)   __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

const size_t data_block_size = 64*1024;
char data_block[64*1024];

size_t k_iter=100;
unsigned long global_crcval = 0;

/*  spin-lock  */
uint32_t lock_unit;

void lock()
{
  while (true) {
uint32_t expected = 0;
#ifdef CAS
if (!__atomic_compare_exchange_n(&lock_unit, &expected, 1, false,
 __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) {
#elif TAS
if (__sync_lock_test_and_set(&lock_unit, 1)) {
#endif
  /* Spin-and-Retry as the lock is acquired by some other thread */
  __asm__ __volatile__("" ::: "memory");
  continue;
}
break;
  }
}

void unlock()
{
#ifdef CAS
  __atomic_store_n(&lock_unit, 0, __ATOMIC_RELEASE);
#else
  __sync_lock_release(&lock_unit);
#endif
}

/*  workload  */
void workload_execute()
{
  for (size_t i = 0; i < k_iter; ++i) {
lock();
/* Each thread try to take lock -> execute critical section -> unlock */
memset(data_block, rand() % 255, data_block_size);
unsigned long crcval = 0;
crc32(crcval, (const unsigned char *)data_block, data_block_size);
global_crcval += crcval;
unlock();
  }
}

int main(int argc, char *argv[]) {
  if (argc != 2) {
std::cerr << "usage:  " << std::endl;
return 1;
  }
  size_t num_of_threads = atol(argv[1]);

  std::thread* handles[num_of_threads];

  auto start = std::chrono::high_resolution_clock::now();

  for (size_t i = 0; i < num_of_threads; ++i) {
handles[i] = new std::thread(workload_execute);
  }
  for (size_t i = 0; i < num_of_threads; ++i) {
handles[i]->join();
  }

  auto finish = std::chrono::high_resolution_clock::now();

  for (size_t i = 0; i < num_of_threads; ++i) {
delete handles[i];
  }

  std::chrono::duration elapsed = finish - start;
  std::cout << "Elapsed time: " << elapsed.count() << " 

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Krunal Bauskar
On Sun, 29 Nov 2020 at 22:23, Alexander Korotkov 
wrote:

> On Sat, Nov 28, 2020 at 1:31 PM Alexander Korotkov 
> wrote:
> > I guess that might depend on the implementation of CAS and TAS.  I bet
> > usage of CAS in spinlock gives advantage when ldxr/stxr are used, but
> > not when swpal/casa are used.  I found out that I can force clang to
> > use swpal/casa by setting "-march=armv8-a+lse".  I'm going to make
> > some experiments on a multicore AWS graviton2 instance with different
> > atomic implementation.
>
> I've made some benchmarks on c6gd.16xlarge ec2 instance with graviton2
> processor of 64 virtual CPUs (graphs and raw results are attached).
> I've analyzed two patches: spinlock using cas by Krunal Bauskar, and
> my implementation of lwlock using lwrex/strex.  My arm lwlock patch
> has the same idea as my previous patch for power: we can put lwlock
> attempt logic between lwrex and strex.  In spite of my previous power
> patch, the arm patch doesn't contain assembly: instead I've used
> C-wrappers over lwrex/strex.
>
> The first series of experiments I've made using standard compiling
> options.  So, LSE instructions from ARM v8.1 weren't used.  Atomics
> were implemented using lwrex/strex pair.
>
> In the read-only benchmark, both spinlock (cas-spinlock graph) and
> lwlock (ldrew-strex-lwlock graph) patches give observable performance
> gain of similar value.   However, performance of combination of these
> patches (ldrew-strex-lwlock-cas-spinlock graph) is close to
> performance of unpatched version.  That could be counterintuitive, but
> I've rechecked that multiple times.
>
> In the read-write benchmark, both spinlock and lwlock patches give
> more significant performance gain, and lwlock patch gives more effect
> than spinlock patch.  Noticeable, that combination of patches now
> gives some cumulative effect instead of counterintuitive slowdown.
>
> Then I've tried to compile postgres with LSE instruction using
> "-march=armv8-a+lse" flag with clang (graphs with -lse suffix).  The
> effect of LSE is HUGE!!!  Unpatched version with LSE is times faster
> than any version without LSE on high concurrency.  In the both
> read-only and read-write benchmarks spinlock patch doesn't show any
> significant difference.  The lwlock patch shows a great slowdown with
> LSE.  Noticeable, in read-write benchmark, lwlock patch shows worse
> results than unpatched version without LSE.  Probably, combining
> different atomics implementations isn't a good idea.
>
> It seems that ARM Kunpeng 920 should support ARM v8.1.  I wonder if
> the published benchmarks results were made with LSE.  I suspect that
> it was not.  It would be nice to repeat the same benchmarks with LSE.
> I'd like to ask Krunal Bauskar and Amit Khandekar to repeat these
> benchmarks with LSE.
>
> My preliminary conclusions are so:
> 1) Since the effect of LSE is so huge, we should advise users of
> multicore ARM servers to compile PostgreSQL with LSE support.  We
> probably should provide separate packaging for ARM v8.1 and higher
> (packages for ARM v8 are still needed for raspberry etc).
> 2) It seems that atomics in ARM v8.1 becomes very similar to x86
> atomics, and it doesn't need special optimizations.  And I think ARM
> v8 processors don't have so many cores and aren't so heavily used in
> high-concurrent environments.  So, special optimizations for ARM v8
> probably aren't worth it.
>

Thanks for the detailed results.

1. Results we shared are w/o lse enabled so using traditional store/load
approach.

2. As you pointed out LSE is enabled starting only with arm-v8.1 but not
all aarch64 tag machines are arm-v8.1 compatible.
This means we would need a separate package or a more optimal way would
be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with
-moutline-atomics that would emit both traditional and lse code and
flow would dynamically select depending on the target machine
(I have blogged about it in MySQL context
https://mysqlonarm.github.io/ARM-LSE-and-MySQL/)

3. Problem with GCC approach is still a lot of distro don't support gcc 9.4
as default.
To use this approach:
* PGSQL will have to roll out its packages using gcc-9.4+ only so that
they are compatible with all aarch64 machines
* but this continues to affect all other users who tend to build pgsql
using standard distro based compiler. (unless they upgrade compiler).



So given all the permutations and combinations, I think we could approach
the problem as follows:

* Enable use of CAS as it is known to have optimal performance (vs TAS)

* Even with LSE enabled, CAS to continue to perform (on par or marginally
better

Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Krunal Bauskar
On Mon, 30 Nov 2020 at 10:14, Tom Lane  wrote:

> Krunal Bauskar  writes:
> > So given all the permutations and combinations, I think we could approach
> > the problem as follows:
>
> > * Enable use of CAS as it is known to have optimal performance (vs TAS)
>
> The results I posted at [1] seem to contradict this for Apple's new
> machines.
>
> In general, I'm pretty skeptical of *all* the results posted so far on
> this thread, because everybody seems to be testing exactly one machine.
> If there's one thing that it's safe to assume about ARM, it's that
> there are a lot of different implementations; and this area seems very
> very likely to differ across implementations.
>

For the results you saw on Mac-Mini was LSE enabled by default.
Maybe clang does it on mac by default to harvest performance.
If that is the case then your results are still inline with what Alexander
posted.

On other hand, Peter saw some % improvement on M1 MacBook.

--

* Can you please check/confirm the LSE enable status on MAC (default).
* Also, if possible run some quick micro-benchmark to help understand how
TAS and CAS perform
on MAC. It would be a big surprise to learn that M1 can execute TAS better
than CAS (without lse).
* I would also suggest if possible try with higher scalability (more than 4
to check if with increase scalability CAS out-perform).

Results I have seen to date (in different contexts too), CAS has
outperformed on Kunpeng, Graviton2, (Other ARM arch), ... (matches with
theory too unless LSE is enabled).
If M1 goes the other way around that would be a big surprise for the arm
community too.



>
> I don't have a big problem with catering for a few different spinlock
> implementations on ARM ... but it's sure unclear how we could decide
> which one to use.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/741389.1606530...@sss.pgh.pa.us
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-11-29 Thread Krunal Bauskar
On Mon, 30 Nov 2020 at 11:38, Tom Lane  wrote:

> Krunal Bauskar  writes:
> > On Mon, 30 Nov 2020 at 10:14, Tom Lane  wrote:
> >> The results I posted at [1] seem to contradict this for Apple's new
> >> machines.
>
> > For the results you saw on Mac-Mini was LSE enabled by default.
>
> Hmm, I don't know how to get Apple's clang to admit what its default
> settings are ... anybody?
>
> However, it does accept "-march=armv8-a+lse", and that seems to
> not be the default, because I get different results from my spinlock-
> pounding test than I did yesterday.  Abbreviating into a table:
>
> --- CFLAGS=-O2 ---  --- CFLAGS="-O2
> -march=armv8-a+lse" ---
>
> TPS HEADCAS patch   HEADCAS patch
>
> clients=1   2127217426122722
> clients=2   1816859 892 950
> clients=4   714 519 610 468
> clients=8   -   -   108 185
>

Thanks for trying this Tom.

-

Some of us may be surprised by the fact that enabling lse is causing
regression (1816 -> 892 or 714 -> 610) with HEAD itself.
While lse is meant to improve the performance. This, unfortunately, is not
always the case at-least based on my previous experience with LSE.too.

I am still wondering why CAS is slower than TAS on M1. What is special on
M1 that other ARM archs has not picked up.

Tom, Sorry to bother you again but this is arising a lot of curiosity about
M1.
Whenever you get time can do some micro-benchmarking on M1 (to understand
TAS vs CAS).
Also, if you can share assembly code is emitted for the TAS vs CAS.


>
> Unfortunately, that still doesn't lead me to think that either LSE
> or CAS are net wins on this hardware.  It's quite clear that LSE
> makes the uncontended case a good bit faster, but the contended case
> is a lot worse, so is that really a tradeoff we want?
>
> > * I would also suggest if possible try with higher scalability (more
> than 4
> > to check if with increase scalability CAS out-perform).
>
> As I said yesterday, running more than 4 processes is just going
> to bring the low-performance cores into the equation, which is likely
> to swamp any interesting comparison.  I did run the test with "-c 8"
> today, as shown in the right-hand columns, and the results seem
> to bear that out.
>
> regards, tom lane
>


-- 
Regards,
Krunal Bauskar