Re: Adding basic NUMA awareness

Andres Freund Fri, 18 Jul 2025 14:04:23 -0700

Hi,

On 2025-07-18 22:48:00 +0200, Tomas Vondra wrote:
> On 7/18/25 18:46, Andres Freund wrote:
> >> For a read-write pgbench I however saw some strange drops/increases of
> >> throughput. I suspect this might be due to some thinko in the clocksweep
> >> partitioning, but I'll need to take a closer look.
> > 
> > Was that with pinning etc enabled or not?
> > 
> 
> IIRC it was with everything enabled, except for numa_procs_pin (which
> pins backend to NUMA node). I found that to actually harm performance in
> some of the tests (even just read-only ones), resulting in uneven usage
> of cores and lower throughput.


FWIW, I really doubt that something like numa_procs_pin is viable outside of
very narrow niches until we have a *lot* more infrastructure in place. Like PG
would need to be threaded, we'd need a separation between thread and
connection and an executor that'd allow us to switch from working on one query
to working on another query.


> > The hardest thing probably is to make the logic for when to check foreign
> > clock sweeps cheap enough.
> > 
> > One way would be to do it whenever a sweep wraps around, that'd probably
> > amortize the cost sufficiently, and I don't think it'd be too imprecise, as
> > we'd have processed that set of buffers in a row without partitioning as
> > well. But it'd probably be too coarse when determining for how long to use a
> > foreign sweep instance. But we probably could address that by rechecking the
> > balanace more frequently when using a foreign partition.
> > 
> 
> What you mean by "it"?

it := Considering switching back from using a "foreign" clock sweep instance
whenever the sweep wraps around.


> What would happen after a sweep wraps around?

The scenario I'm worried about is this:

1) a bunch of backends read buffers on numa node A, using the local clock
   sweep instance

2) due to all of that activity, the clock sweep advances much faster than the
   clock sweep for numa node B

3) the clock sweep on A wraps around, we discover the imbalance, and all the
   backend switch to scanning on numa node B, moving that clock sweep ahead
   much more aggressively

4) clock sweep on B wraps around, there's imbalance the other way round now,
   so they all switch back to A



> > Another way would be to have bgwriter manage this. Whenever it detects that
> > one ring is too far ahead, it could set a "avoid this partition" bit, which
> > would trigger backends that natively use that partition to switch to foreign
> > partitions that don't currently have that bit set.  I suspect there's a
> > problem with that approach though, I worry that the amount of time that
> > bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
> > much imbalance.
> > 
> 
> I'm afraid having hard "avoid" flags would lead to sudden and unexpected
> changes in performance as we enable/disable partitions. I think a good
> solution should "smooth it out" somehow, e.g. by not having a true/false
> flag, but having some sort of "preference" factor with values between
> (0.0, 1.0) which says how much we should use that partition.

Yea, I think that's a fair worry.


> I was imagining something like this:
> 
> Say we know the number of buffers allocated for each partition (in the
> last round), and we (or rather the BgBufferSync) calculate:
> 
>     coefficient = 1.0 - (nallocated_partition / nallocated)
> 
> and then use that to "correct" which partition to allocate buffers from.
> Or maybe just watch how far from the "fair share" we were in the last
> interval, and gradually increase/decrease the "partition preference"
> which would say how often we need to "steal" from other partitions.
> 
> E.g. we find nallocated_partition is 2x the fair share, i.e.
> 
>    nallocated_partition / (nallocated / nparts) = 2.0
> 
> Then we say 25% of the time look at some other partition, to "cut" the
> imbalance in half. And then repeat that in the next cycle, etc.
> 
> So a process would look at it's "home partition" by default, but it's
> "roll a dice" first and if above the calculated probability it'd pick
> some other partition instead (this would need to be done so that it gets
> balanced overall).

That does sound reasonable.


> If the bgwriter interval is too long, maybe the recalculation could be
> triggered regularly after any of the clocksweeps wraps around, or after
> some number of allocations, or something like that.

I'm pretty sure the bgwriter might not be often enough and not predictably
frequently running for that.

Greetings,

Andres Freund

Re: Adding basic NUMA awareness

Reply via email to