xudong963 commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3155072658
I'll have a look tomorrow
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specif
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2248281203
##
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##
@@ -734,6 +734,366 @@ async fn test_topk_dynamic_filter_pushdown() {
);
}
+#[tokio:
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2248281203
##
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##
@@ -734,6 +734,366 @@ async fn test_topk_dynamic_filter_pushdown() {
);
}
+#[tokio:
zhuqi-lucas commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2247959415
##
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##
@@ -734,6 +734,366 @@ async fn test_topk_dynamic_filter_pushdown() {
);
}
+#[tok
alamb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141417070
Thanks @adriangb -- I will put this on my list of PRs to review more
carefully in the next few days
--
This is an automated message from the Apache Git Service.
To respond to the me
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141365277
Personally I would like to move forward with this and then make another PR
to push down a reference to the entire hash table.
--
This is an automated message from the Apache Git S
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141364005
Here it is in action:
```sql
COPY (
with data as (
select unnest(generate_series(1, )) as id
)
select
id,
(id
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141237953
Guess it just doesn't make a difference then. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
UR
alamb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141204546
It did run
> Benchmark tpch_mem_sf1.json
I can run whatever benchmark you want - just let me know
--
This is an automated message from the Apache Git Service.
To re
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141186574
Might be worth running TPCH or another benchmark with joins?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use
alamb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141175283
🤖: Benchmark completed
Details
```
Comparing HEAD and hash-join-pushdown
Benchmark clickbench_extended.json
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2246192857
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -1039,12 +1196,50 @@ async fn collect_left_input(
let data = JoinLeftData::new(
hashmap,
alamb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3141078662
🤖 `./gh_compare_branch.sh` [Benchmark
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh)
Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3140948881
@alamb could I ask you to kick off some benchmarks?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3140961969
I think I've addressed all of the feedback and rebased on main / changes
broken out into other PRs. @Dandandan @xudong963 I've tagged you both for
review.
@alamb would you m
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2246082774
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection, s
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2246051561
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection, s
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2245591958
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection, s
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2245195106
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -1039,12 +1196,50 @@ async fn collect_left_input(
let data = JoinLeftData::new(
hashmap,
LiaCastaneda commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2184901077
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -1039,12 +1196,50 @@ async fn collect_left_input(
let data = JoinLeftData::new(
hashma
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3026127559
Btw here's an article that explains how DuckDB does join filter pushdown. It
sounds like they _only_ push down min/max filters:
https://duckdb.org/2024/09/09/announcing-duckdb-110.h
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3025944992
I've pulled out part of this PR, the part about pushing filters down through
HashJoinExec plus some new changes to the filter pushdown APIs into
https://github.com/apache/datafusion
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3006600650
@alamb I'd be interested to see what benchmarks say if you don't mind
kicking them off?
--
This is an automated message from the Apache Git Service.
To respond to the message, ple
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2164715273
##
datafusion/physical-plan/src/filter_pushdown.rs:
##
@@ -353,6 +353,18 @@ impl FilterDescription {
}
}
+pub fn with_child_pushdown(
Review
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3004662324
@Dandandan any chance you'd be willing to contribute your implementation of
sharing `Arc` so we use something we know is working / I don't have
to re-invent the wheel? I think you c
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3001813655
I'll add that :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment
Dandandan commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3001863494
> I was originally planning on keeping this PR smaller but it's been growing
so I might as well add the Arc :)
Feel free to PR it however you like ;)
--
This is an automa
Dandandan commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3001717244
To share some experience, we recently added some similar pushdown for
HashJoinExec (at Coralogix) using sharing of `Arc` / comparing
column hashes and it is seems so far very effec
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2164716153
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -666,10 +679,25 @@ impl DisplayAs for HashJoinExec {
.map(|(c1, c2)| format!("({c1}
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2164714496
##
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##
@@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() {
);
}
+#[tokio:
xudong963 commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2162984236
##
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##
@@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() {
);
}
+#[tokio
xudong963 commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2162979272
##
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs:
##
@@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() {
);
}
+#[tokio
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2991563630
> It's hard to say generally, but a hashtable lookup which fits into cache
on a `u64` key can be really fast.
I guess only benchmarks can tell. But I still think the scalar bo
Dandandan commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2991531016
> > I think it makes sense to only filter on the shared hashmap and not
bothering with the min/max values - creating hashes and doing a single table
lookup is quite fast, so I thin
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2989099023
> I think it also makes sense to also thing about a heuristic we want to use
to use this pushdown only when we think it might be useful - e.g. the left side
is much smaller than the
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985988638
> I think it makes sense to only filter on the shared hashmap and not
bothering with the min/max values - creating hashes and doing a single table
lookup is quite fast, so I think w
Dandandan commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985881381
> > I think doing only the lookup is preferable above also computing /
checking the bounds, I think the latter might create more overhead
>
> My thought was that for some cas
Dandandan commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155602331
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection,
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155025406
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection, s
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155006506
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection, s
adriangb commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2154970369
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection, s
adriangb commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984788592
> I think doing only the lookup is preferable above also computing /
checking the bounds, I think the latter might create more overhead
My thought was that for some cases the
Dandandan commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984782413
Sorry, misclicked a button.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the spec
Dandandan commented on PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984776797
I tink we should also consider a heuristic for not evaluating the filter if
it's not useful.
Also I think doing only the lookup is preferable above also computing /
checking
Dandandan closed pull request #16445: Add dynamic filter (bounds) pushdown to
HashJoinExec
URL: https://github.com/apache/datafusion/pull/16445
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the sp
Dandandan commented on code in PR #16445:
URL: https://github.com/apache/datafusion/pull/16445#discussion_r2154953337
##
datafusion/physical-plan/src/joins/hash_join.rs:
##
@@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec {
try_embed_projection(projection,
adriangb opened a new pull request, #16445:
URL: https://github.com/apache/datafusion/pull/16445
Part of #7955.
My goal here is to lay the groundwork for pushing down joins.
I am only implementing bounds pushdown because I am sure that is cheap and
it will probably be quite effecti
47 matches
Mail list logo