Hi,

On 6/25/24 12:04, Tomas Vondra wrote:


On 6/24/24 17:05, Robert Haas wrote:
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.


Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.

That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.


OK. I didn't realize global variables start a zero.


I haven't fixed this yet, but it's pretty clear the "init" is not really needed, because it did the memset() wrong:

memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));

This only resets one byte of the array, yet it still worked correctly.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)


Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.


I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.

I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.


That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.


I finally got to do those experiments. The scripts and results (both raw and summarized) are too big to attach everything here, available at

    https://github.com/tvondra/scalability-tests

The initial patch used 64 (which means 1024 fast-path slots), I ran the tests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thought about testing with ~64k groups, but I didn't go with the extreme value because I don't quite see the point.

It would only matter for cases with a truly extreme number of partitions (64k groups is ~1M fast-path slots), and just creating enough partitions would take a lot of time. Moreover, with that many partitions we seems to have various other bottlenecks, and improving this does not make it really practical. And it's so slow the benchmark results are somewhat bogus too.

Because if we achieve 50 tps with 1000 partitions, does it really matter a patch changes that to 25 of 100 tps? I doubt that, especially if going to 100 partitions gives you 2000 tps. Now imagine you have 10k or 100k partitions - how fast is that going to be?

So I think stopping at 1024 groups is sensible, and if there are some inefficiencies / costs, I'd expect those to gradually show up even at those lower sizes.

But if you look at results, for example from the "join" test:

  https://github.com/tvondra/scalability-tests/blob/main/join.pdf

there's no such negative effect. the table shows results for different combinations of parameters, with the first group of columns being on regular glibc, the second one has glibc tuning (see [1] for details). And the values are for different number of fast-path groups (0 means the patch was not applied).

And the color scale on the show the impact of increasing the number of groups. So for example when a column for "32 groups" says 150%, it means going from 8 to 32 groups improved throughput to 1.5x. As usual, green is "good" and red is "bad".

But if you look at the tables, there's very little change - most of the values are close to 100%. This might seem a bit strange, considering the promise of these patches is to improve throughput, and "no change" is an absence of that. But that's because the charts illustrate effect of changing the group count with other parameters fixed. It never compares runs with/without glibc runing, and that's an important part of the improvement. Doing the pivot table a bit differently would still show a substantial 2-3x improvement.

There's a fair amount of noise - especially for the rpi5 machines (not the right hw for sensitive benchmarks), but also on some i5/xeon runs. I attribute this to only doing one short run (10s) for each combinations of parameters. I'll do more runs next time.

Anyway, I think these results show a couple things:

1) There's no systemic negative effect of increasing the number of groups. We could go with 32k or 64k groups, and it doesn't seem like there would be a problem.

2) But there's not much point in doing that, because we run into various other bottlenecks well before having that many locks. By the results, it doesn't seem going beyond 32 or 64 groups would give us much.

3) The memory allocation caching (be it the mempool patch, or the glibc tuning like in this test round) is a crucial piece for this. Not doing that means some tests get no improvement at all, or a much smaller one.

4) The increase of NUM_LOCK_PARTITIONS has very limited effect, or perhaps even no effect at all.


Based on this, my plan is to polish the patch adding fast-path groups, with either 32 or 64 groups, which seems to be reasonable values. Then in the future, if/when the other bottlenecks get addressed, we can rethink and increase this.

This however reminds me that all those machines are pretty small. Which is good for showing it doesn't break existing/smaller systems, but the initial goal of the patch was to improve behavior on big boxes. I don't have access to the original box at the moment, so if someone could provide an access to one of those big epyc/xeon machines with 100+ cores for a couple days, that would be helpful.


That being said, I think it's pretty clear how serious the issue with memory allocation overhead can be, especially in cases when the existing memory context caching is ineffective (like for the btree palloc). I'm not sure what to do about that. The mempool patch shared in this thread does the trick, it's fairly complex/invasive. I still like it, but maybe doing something with the glibc tuning would be enough - it's not as effective, but 80% of the improvement is better than no improvement.



regards


[1] https://www.postgresql.org/message-id/0da51f67-c88b-497e-bb89-d5139309e...@enterprisedb.com

--
Tomas Vondra


Reply via email to