github-actions[bot] closed pull request #15981: Optimize hash partitioning for
cache friendliness
URL: https://github.com/apache/datafusion/pull/15981
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to
github-actions[bot] commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-3071664035
Thank you for your contribution. Unfortunately, this pull request is stale
because it has been open 60 days with no activity. Please remove the stale
label or comment or
alamb commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2867285850
> I think something like that is done already in the "convert to state"
logic - it will dynamically decide to skip aggregating once it sees that the
group vs input rows ratio is small.
Dandandan commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2863265862
I think something like that is done already in the "convert to state" logic
- it will dynamically decide to skip aggregating once it sees that the group vs
input rows ratio is smal
ctsk commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2863005079
SInce partition does not appear to be a limiting factor in aggregations, I
wonder if it makes sense to investigate a lower-quality pre-aggregation (i.e.
let more tuples pass to the fina
alamb commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2860307673
🤖: Benchmark completed
Details
```
Comparing HEAD and experiment_repartition-optimization
Benchmark clickbench_extended.json
--
alamb commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2860166649
🤖 `./gh_compare_branch.sh` [Benchmark
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh)
Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun
Dandandan commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859776487
Nice, that seems like a great result!
i think the main improvement seems to be after this would be using the
`take_in` API you proposed in arrow-rs (mainly to avoid `concat`)
ctsk commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859359952
I've ran clickbench_partitioned and tpch_mem10 - on a machine with 16 cores.
The clickbench results are pretty much the same, tpch_mem10 ran significantly
faster.
data
Dandandan commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859289524
nice, could you share some perf numbers of this approach?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use t
ctsk commented on PR #15981:
URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859158487
Another tried-and-true strategy for this kind of problem is to partition in
multiple stages: Instead of having a "wide" fanout partitioning to, for
instance 256 partitions, it is prefer
11 matches
Mail list logo