Re: [PR] Refactor hash join dynamic filtering for progressive bounds application [datafusion]

via GitHub Wed, 17 Sep 2025 22:37:09 -0700


adriangb commented on PR #17632:
URL: https://github.com/apache/datafusion/pull/17632#issuecomment-3305497988


   > > > It shouldn't be too bad: without filters you'd still have to run the 
hash function once in RepartitionExec and another hash function in 
HashJoinExec. So we're running 2 times instead of 3 _for rows that match the 
filter_. For rows that are pruned we run it 1 time instead of 2. And that's 
only until all of the build sides are done, then we may run it 0 times.
   > 
   > > 
   > 
   > > The `hash(...) % n != partition_id` portion of the filter gets added for 
each build partition, right? If that's the case then in the worst case we're 
running it up to `N` times just for the dynamic filter?
   > 
   > 
   > 
   > Is `N` the number of partitions? Yeah that's a good point. That could be a 
problem. It seems obvious that there should be a way to avoid re-evaluating an 
expression within a single `evaluate` call by building an evaluation DAG 
instead of a tree but you're right that doesn't currently exist.
   
   Simple solution: use a `CASE (hash(col) % n_part) WHEN 0 ... WHEN 1 ... ELSE 
true`. Then the hash will only be evaluated once. I'll push this tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Refactor hash join dynamic filtering for progressive bounds application [datafusion]

Reply via email to