Re: [I] Optimize the join operators [datafusion]

2025-07-17 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3086584860 > > > > Updated: our benchmark is using datafusion internal source to benchmark instead of datafusion-python, i am not sure if it will make a difference. > > > > > >

Re: [I] Optimize the join operators [datafusion]

2025-07-17 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3084255495 > > > Updated: our benchmark is using datafusion internal source to benchmark instead of datafusion-python, i am not sure if it will make a difference. > > > > > > Th

Re: [I] Optimize the join operators [datafusion]

2025-07-17 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3084151418 > > Updated: our benchmark is using datafusion internal source to benchmark instead of datafusion-python, i am not sure if it will make a difference. > > The results a

Re: [I] Optimize the join operators [datafusion]

2025-07-17 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3084107356 > Updated: our benchmark is using datafusion internal source to benchmark instead of datafusion-python, i am not sure if it will make a difference. The results are similar

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082712617 Updated parquet result from my local using the 1e8 dataset, it even faster: ```rust ./bench.sh run h2o_medium_join_parquet *** DataFusi

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082400648 > > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > > > > > > > > [@MrPowers](https://github.com/

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082362965 > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > > > > > [@MrPowers](https://github.com/MrPowers) I

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082347855 @mrpowers-wb I submit the PR for h2o benchmark to support parquet format in datafusion, but it blocks by falsa join dataset generate, details: https://github.com/a

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082269096 > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > [@MrPowers](https://github.com/MrPowers) I am using t

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082082344 > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: @MrPowers I am using the **1e8** dataset. ``` target

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
MrPowers commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3080025409 @UBarney - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: ![Image](https://github.com/user-attachments/assets/f96b5301-1986-4824-a715-3e2a53895ca8)

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078931663 > Thanks [@nuno-faria](https://github.com/nuno-faria) that's a great insight (for TPC-H / very nested joins we probably should implement a smarter join order algorithm). >

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078275180 > [@zhuqi-lucas](https://github.com/zhuqi-lucas) - these benchmarks use Parquet files, see the querybench repo for the code: https://github.com/MrPowers/querybench. I think

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
mrpowers-wb commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078052376 @zhuqi-lucas - these benchmarks use Parquet files, see the querybench repo for the code: https://github.com/MrPowers/querybench. I think Parquet is a lot better for these b

Re: [I] Optimize the join operators [datafusion]

2025-07-15 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3074225566 > DataFusion is underperforming the Polars streaming engine on some localhost join queries (1e8 rows of data on a Macbook M3 with 16GB of RAM): > > https://private-use

Re: [I] Optimize the join operators [datafusion]

2025-07-15 Thread via GitHub
Dandandan commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073881761 Thanks @nuno-faria that's a great insight (for TPC-H / very nested joins we probably should implement a smarter join order algorithm). For h2o joins however, it seems it

Re: [I] Optimize the join operators [datafusion]

2025-07-15 Thread via GitHub
nuno-faria commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073363029 I have also been looking at join performance and I think the main limitation is the order, followed by the lack of join parameterization. In TPC-H, 6 queries use a bad join

Re: [I] Optimize the join operators [datafusion]

2025-07-11 Thread via GitHub
Dandandan commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063915741 Also, referencing the direct indexing / perfect hash join here. I think that should be relatively simple to implement. https://github.com/duckdb/duckdb/pull/1959 #816

Re: [I] Optimize the join operators [datafusion]

2025-07-11 Thread via GitHub
jonathanc-n commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063872693 Another thing we can do is hash it once and use parts of the hash at a time during `RepartitionExec` and building the hashtable. This is made even better with having to do a

Re: [I] Optimize the join operators [datafusion]

2025-07-11 Thread via GitHub
Dandandan commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063754335 > > is it using a hash table or open addressing (df doesn't have the latter) > > [@XiangpengHao](https://github.com/XiangpengHao) has mentioned several times that we thi

Re: [I] Optimize the join operators [datafusion]

2025-07-11 Thread via GitHub
alamb commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063749417 > is it using a hash table or open addressing (df doesn't have the latter) @XiangpengHao has mentioned several times that we think DuckDB uses radix trees (which work l

Re: [I] Optimize the join operators [datafusion]

2025-07-11 Thread via GitHub
Dandandan commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063067577 Besides profiling, I would like to suggest to research how the other engines are running the join and extract some high level learnings out of it: * is it using a hash t

Re: [I] Optimize the join operators [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3050155137 Ohh, I missed the part in the documentation which had the h2o_small_join able to downloaded, i thought the h2o_small had it all 😆 -- This is an automated message from the

Re: [I] Optimize the join operators [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3049026189 > I'm having trouble runnigng flamegraphs on these h2o queries, not sure how to get these csv files to work with them. Anybody know how? I think you can generate the data us

Re: [I] Optimize the join operators [datafusion]

2025-07-07 Thread via GitHub
comphead commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3046954409 > I'm having trouble runnigng flamegraphs on these h2o queries, not sure how to get these csv files to work with them. Anybody know how? if you run them locally you can t

Re: [I] Optimize the join operators [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3046943380 I'm having trouble runnigng flamegraphs on these h2o queries, not sure how to get these csv files to work with them. Anybody know how? -- This is an automated message from

Re: [I] Optimize the join operators [datafusion]

2025-07-07 Thread via GitHub
alamb commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3046572751 FYI @jonathanc-n @UBarney @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t