Re: scalability bottlenecks with (many) partitions (and more)

Tomas Vondra Mon, 05 Aug 2024 04:38:50 -0700

Hi,

On 6/25/24 12:04, Tomas Vondra wrote:



On 6/24/24 17:05, Robert Haas wrote:

On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.


I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.


Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.

That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)


I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.


OK. I didn't realize global variables start a zero.

I haven't fixed this yet, but it's pretty clear the "init" is not reallyneeded, because it did the memset() wrong:


memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));

This only resets one byte of the array, yet it still worked correctly.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)


Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.


I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.

I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.


That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.

I finally got to do those experiments. The scripts and results (both rawand summarized) are too big to attach everything here, available at


    https://github.com/tvondra/scalability-tests

The initial patch used 64 (which means 1024 fast-path slots), I ran thetests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thoughtabout testing with ~64k groups, but I didn't go with the extreme valuebecause I don't quite see the point.

It would only matter for cases with a truly extreme number of partitions(64k groups is ~1M fast-path slots), and just creating enough partitionswould take a lot of time. Moreover, with that many partitions we seemsto have various other bottlenecks, and improving this does not make itreally practical. And it's so slow the benchmark results are somewhatbogus too.

Because if we achieve 50 tps with 1000 partitions, does it really mattera patch changes that to 25 of 100 tps? I doubt that, especially if goingto 100 partitions gives you 2000 tps. Now imagine you have 10k or 100kpartitions - how fast is that going to be?

So I think stopping at 1024 groups is sensible, and if there are someinefficiencies / costs, I'd expect those to gradually show up even atthose lower sizes.


But if you look at results, for example from the "join" test:

  https://github.com/tvondra/scalability-tests/blob/main/join.pdf

there's no such negative effect. the table shows results for differentcombinations of parameters, with the first group of columns being onregular glibc, the second one has glibc tuning (see [1] for details).And the values are for different number of fast-path groups (0 means thepatch was not applied).

And the color scale on the show the impact of increasing the number ofgroups. So for example when a column for "32 groups" says 150%, it meansgoing from 8 to 32 groups improved throughput to 1.5x. As usual, greenis "good" and red is "bad".

But if you look at the tables, there's very little change - most of thevalues are close to 100%. This might seem a bit strange, considering thepromise of these patches is to improve throughput, and "no change" is anabsence of that. But that's because the charts illustrate effect ofchanging the group count with other parameters fixed. It never comparesruns with/without glibc runing, and that's an important part of theimprovement. Doing the pivot table a bit differently would still show asubstantial 2-3x improvement.

There's a fair amount of noise - especially for the rpi5 machines (notthe right hw for sensitive benchmarks), but also on some i5/xeon runs. Iattribute this to only doing one short run (10s) for each combinationsof parameters. I'll do more runs next time.


Anyway, I think these results show a couple things:

1) There's no systemic negative effect of increasing the number ofgroups. We could go with 32k or 64k groups, and it doesn't seem likethere would be a problem.

2) But there's not much point in doing that, because we run into variousother bottlenecks well before having that many locks. By the results, itdoesn't seem going beyond 32 or 64 groups would give us much.

3) The memory allocation caching (be it the mempool patch, or the glibctuning like in this test round) is a crucial piece for this. Not doingthat means some tests get no improvement at all, or a much smaller one.

4) The increase of NUM_LOCK_PARTITIONS has very limited effect, orperhaps even no effect at all.

Based on this, my plan is to polish the patch adding fast-path groups,with either 32 or 64 groups, which seems to be reasonable values. Thenin the future, if/when the other bottlenecks get addressed, we canrethink and increase this.

This however reminds me that all those machines are pretty small. Whichis good for showing it doesn't break existing/smaller systems, but theinitial goal of the patch was to improve behavior on big boxes. I don'thave access to the original box at the moment, so if someone couldprovide an access to one of those big epyc/xeon machines with 100+ coresfor a couple days, that would be helpful.

That being said, I think it's pretty clear how serious the issue withmemory allocation overhead can be, especially in cases when the existingmemory context caching is ineffective (like for the btree palloc). I'mnot sure what to do about that. The mempool patch shared in this threaddoes the trick, it's fairly complex/invasive. I still like it, but maybedoing something with the glibc tuning would be enough - it's not aseffective, but 80% of the improvement is better than no improvement.




regards

[1]https://www.postgresql.org/message-id/0da51f67-c88b-497e-bb89-d5139309e...@enterprisedb.com


--
Tomas Vondra

Re: scalability bottlenecks with (many) partitions (and more)

Reply via email to