Re: RFC: Improve CPU cache locality of syscache searches

Yura Sokolov Thu, 05 Aug 2021 20:44:11 -0700

Andres Freund писал 2021-08-05 23:12:

Hi,
On 2021-08-05 12:27:49 -0400, John Naylor wrote:
On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <and...@anarazel.de>wrote:
> On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> > typedef struct cc_bucket
> > {
> >   uint32 hashes[4];
> >   catctup *ct[4];
> >   dlist_head;
> > };
>
> I'm not convinced that the above the right idea though. Even if the hash
> matches, you're still going to need to fetch at least catctup->keys[0]
from
> a separate cacheline to be able to return the cache entry.

I see your point. It doesn't make sense to inline only part of the
information needed.
At least not for the unconditionally needed information.
Although I'm guessing inlining just two values in the 4-key casewouldn't
buy much.
Not so sure about that. I'd guess that two key comparisons take morecyclesthan a cacheline fetch the further keys (perhaps not if we had inlinedkeycomparisons). I.e. I'd expect out-of-order + speculative execution tohide the
latency for fetching the second cacheline for later key values.
> If we stuffed four values into one bucket we could potentially SIMD the
hash
> and Datum comparisons ;)

;-) That's an interesting future direction to consider when we support
building with x86-64-v2. It'd be pretty easy to compare a vector ofhashesand quickly get the array index for the key comparisons (ignoring forthe
moment how to handle the rare case of multiple identical hashes).
However, we currently don't memcmp() the Datums and instead call an
"eqfast" function, so I don't see how that part would work in avector
setting.
It definitely couldn't work unconditionally - we have to deal withtext,
oidvector, comparisons after all. But we could use it for the other
types. However, looking at the syscaches, I think it'd not very oftenbe
applicable for caches with enough columns.
I have wondered before whether we should have syscache definitionsgeneratecode specific to each syscache definition. I do think that'd give agood bitof performance boost. But I don't see a trivial way to get therewithout
notational overhead.
We could define syscaches in a header using a macro that's defineddifferently
in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.
Or perhaps we should move syscache definitions into the pg_*.h headers,andgenerate the relevant code as part of their processing. That seems likeitcould be nice from a modularity POV alone. And it could do better thanthecurrent approach, because we could hardcode the types for columns inthe
syscache definition without increasing notational overhead.


Why don't use simplehash or something like that? Open-addressing schemes
show superior cache locality.

It could be combined: use hashValue as a key in simplehash and dlist for
hashValue collision handling. 99.99% of time there will be no collisions
on hashValue itself, therefore it will be almost always two-three memory
lookups: one-two for dlist_head in simple_hash by hashValue and then
fetching first element in dlist.

And code will remain almost same. Just "bucket" search will change abit.

(And I'd recommend use lower fill factor for this simplehash. At most0.85).


Well, simplehash entry will be 24 bytes this way. If simplehash template

supports external key/element storage, then it could be shrunk to 16bytes,

and syscache entries will not need dlist_node. (But it doesn't at the
moment).

And custom open-addressing table could be even with 8 bytes per element:
- element is a pointer to entry,
- missing node is encoded as NULL,

- with fill factor as low as 0.66 there will be small amount ofcollisions,therefore non-empty entry will be matched entry most of time, andmemory

  lookup for key comparison will be amortized free.

Note that 8byte entry with fill factor 0.66 consumes amortized 12.12byte,

while 16byte entry with fill factor 0.99 consumes amortized 16.16byte.

regards,
Yura Sokolov

y.soko...@postgrespro.ru
funny.fal...@gmail.com

Re: RFC: Improve CPU cache locality of syscache searches

Reply via email to