Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-08-05 Thread via GitHub
xudong963 commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3155072658 I'll have a look tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-08-01 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2248281203 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -734,6 +734,366 @@ async fn test_topk_dynamic_filter_pushdown() { ); } +#[tokio:

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-08-01 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2248281203 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -734,6 +734,366 @@ async fn test_topk_dynamic_filter_pushdown() { ); } +#[tokio:

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-08-01 Thread via GitHub
zhuqi-lucas commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2247959415 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -734,6 +734,366 @@ async fn test_topk_dynamic_filter_pushdown() { ); } +#[tok

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
alamb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141417070 Thanks @adriangb -- I will put this on my list of PRs to review more carefully in the next few days -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141365277 Personally I would like to move forward with this and then make another PR to push down a reference to the entire hash table. -- This is an automated message from the Apache Git S

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141364005 Here it is in action: ```sql COPY ( with data as ( select unnest(generate_series(1, )) as id ) select id, (id

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141237953 Guess it just doesn't make a difference then. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
alamb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141204546 It did run > Benchmark tpch_mem_sf1.json I can run whatever benchmark you want - just let me know -- This is an automated message from the Apache Git Service. To re

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141186574 Might be worth running TPCH or another benchmark with joins? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
alamb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141175283 🤖: Benchmark completed Details ``` Comparing HEAD and hash-join-pushdown Benchmark clickbench_extended.json

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2246192857 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -1039,12 +1196,50 @@ async fn collect_left_input( let data = JoinLeftData::new( hashmap,

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
alamb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141078662 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3140948881 @alamb could I ask you to kick off some benchmarks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3140961969 I think I've addressed all of the feedback and rebased on main / changes broken out into other PRs. @Dandandan @xudong963 I've tagged you both for review. @alamb would you m

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2246082774 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2246051561 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2245591958 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-31 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2245195106 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -1039,12 +1196,50 @@ async fn collect_left_input( let data = JoinLeftData::new( hashmap,

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-04 Thread via GitHub
LiaCastaneda commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2184901077 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -1039,12 +1196,50 @@ async fn collect_left_input( let data = JoinLeftData::new( hashma

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-01 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3026127559 Btw here's an article that explains how DuckDB does join filter pushdown. It sounds like they _only_ push down min/max filters: https://duckdb.org/2024/09/09/announcing-duckdb-110.h

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-07-01 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3025944992 I've pulled out part of this PR, the part about pushing filters down through HashJoinExec plus some new changes to the filter pushdown APIs into https://github.com/apache/datafusion

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3006600650 @alamb I'd be interested to see what benchmarks say if you don't mind kicking them off? -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2164715273 ## datafusion/physical-plan/src/filter_pushdown.rs: ## @@ -353,6 +353,18 @@ impl FilterDescription { } } +pub fn with_child_pushdown( Review

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-25 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3004662324 @Dandandan any chance you'd be willing to contribute your implementation of sharing `Arc` so we use something we know is working / I don't have to re-invent the wheel? I think you c

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-24 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3001813655 I'll add that :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-24 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3001863494 > I was originally planning on keeping this PR smaller but it's been growing so I might as well add the Arc :) Feel free to PR it however you like ;) -- This is an automa

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-24 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3001717244 To share some experience, we recently added some similar pushdown for HashJoinExec (at Coralogix) using sharing of `Arc` / comparing column hashes and it is seems so far very effec

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-24 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2164716153 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -666,10 +679,25 @@ impl DisplayAs for HashJoinExec { .map(|(c1, c2)| format!("({c1}

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-24 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2164714496 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() { ); } +#[tokio:

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-24 Thread via GitHub
xudong963 commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2162984236 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() { ); } +#[tokio

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-23 Thread via GitHub
xudong963 commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2162979272 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() { ); } +#[tokio

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-20 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2991563630 > It's hard to say generally, but a hashtable lookup which fits into cache on a `u64` key can be really fast. I guess only benchmarks can tell. But I still think the scalar bo

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-20 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2991531016 > > I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I thin

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2989099023 > I think it also makes sense to also thing about a heuristic we want to use to use this pushdown only when we think it might be useful - e.g. the left side is much smaller than the

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985988638 > I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think w

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985881381 > > I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead > > My thought was that for some cas

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155602331 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection,

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155025406 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155006506 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2154970369 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984788592 > I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead My thought was that for some cases the

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984782413 Sorry, misclicked a button. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984776797 I tink we should also consider a heuristic for not evaluating the filter if it's not useful. Also I think doing only the lookup is preferable above also computing / checking

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan closed pull request #16445: Add dynamic filter (bounds) pushdown to HashJoinExec URL: https://github.com/apache/datafusion/pull/16445 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2154953337 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection,

[PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb opened a new pull request, #16445: URL: https://github.com/apache/datafusion/pull/16445 Part of #7955. My goal here is to lay the groundwork for pushing down joins. I am only implementing bounds pushdown because I am sure that is cheap and it will probably be quite effecti