Hi,

I thought it might be interesting to revive this thread because the
improvements i saw from Thomas’s work, and even just simple prefetching of
bucket headers for the probe phase in-memory (to see the effect of
prefetching), are still showing nice improvements. Here are some results
for simple prefetching in probe phase only, on Thomas's last benchmark
query (in-memory self join):
*Task clock*: -25.6%
*Page faults*: -21.46%
*Cycles*: -17.39%
*L1 dcache loads: *-13.78%
*L1 dcache load misses*: -30.1%
*LLC loads*: -36.7%
*LLC load misses*: -55.1%
*dTLB** loads*: -13.77%
*dTLB Misses: +*0.5%
*Cache references: *-9.5%
*Cache misses*: -7.9%
*IPC*: -6.4%

So, I thought it might be worth relooking at this, even if we avoid major
architectural changes in the hash join executor required by more advanced
techniques. Though it will require a lot of perf benchmarking to prove the
performance improvements, i think its doable to prove or *opposite* what we
can find with minimal architectural changes.

Also, about the Linux experience, it was for lists (pointer chasing)
prefetching (see linux thread <https://lwn.net/Articles/444346/>), which
was happening on Intel with prefetch(null) in the case of doing list
prefetching on short-sized lists, hitting the end of the list very often
(like chained hash tables). This is still noticeable in Postgres if we try
to do prefetching on intra-bucket scan, performance is relatively the same
or even worse.

Thoughts?

Reply via email to