Re: [PR] Fast path for joins with distinct values in build side [datafusion]

via GitHub Sat, 24 May 2025 03:05:56 -0700


Dandandan commented on PR #16153:
URL: https://github.com/apache/datafusion/pull/16153#issuecomment-2906697087


   > > This optimization is neat and already covers the common case of joins on 
primary keys. I think we can further optimize the join hash table - even for 
cases where _some_ keys might have chains. Instead of looking for a 0 value in 
the `next` vector, we can encode whether there is a next value in the top bit 
of the current slot - thus saving a lookup in the `next` vector on every probe 
that has at least a single match.
   > > I don't know how well this plays with the streaming join hash map though 
=)
   > 
   > That sounds like a neat thing to try! Another (smaller) optimization I can 
think of is encode hashmap and next list with `u32` indices / offsets if 
possible (so it fits more easily in CPU cache by halving the data).
   
   Filed https://github.com/apache/datafusion/issues/16179


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fast path for joins with distinct values in build side [datafusion]

Reply via email to