zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3086584860
> > > > Updated: our benchmark is using datafusion internal source to
benchmark instead of datafusion-python, i am not sure if it will make a
difference.
> > >
> > >
UBarney commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3084255495
> > > Updated: our benchmark is using datafusion internal source to
benchmark instead of datafusion-python, i am not sure if it will make a
difference.
> >
> >
> > Th
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3084151418
> > Updated: our benchmark is using datafusion internal source to benchmark
instead of datafusion-python, i am not sure if it will make a difference.
>
> The results a
UBarney commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3084107356
> Updated: our benchmark is using datafusion internal source to benchmark
instead of datafusion-python, i am not sure if it will make a difference.
The results are similar
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082712617
Updated parquet result from my local using the 1e8 dataset, it even faster:
```rust
./bench.sh run h2o_medium_join_parquet
***
DataFusi
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082400648
> > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join
results on my M3 Macbook with 16GB of RAM:
> > >
> > >
> > > [@MrPowers](https://github.com/
UBarney commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082362965
> > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results
on my M3 Macbook with 16GB of RAM:
> >
> >
> > [@MrPowers](https://github.com/MrPowers) I
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082347855
@mrpowers-wb
I submit the PR for h2o benchmark to support parquet format in datafusion,
but it blocks by falsa join dataset generate, details:
https://github.com/a
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082269096
> > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results
on my M3 Macbook with 16GB of RAM:
>
> [@MrPowers](https://github.com/MrPowers) I am using t
UBarney commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082082344
> [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on
my M3 Macbook with 16GB of RAM:
@MrPowers I am using the **1e8** dataset.
```
target
MrPowers commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3080025409
@UBarney - here are the 1e7 join results on my M3 Macbook with 16GB of RAM:

UBarney commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078931663
> Thanks [@nuno-faria](https://github.com/nuno-faria) that's a great insight
(for TPC-H / very nested joins we probably should implement a smarter join
order algorithm).
>
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078275180
> [@zhuqi-lucas](https://github.com/zhuqi-lucas) - these benchmarks use
Parquet files, see the querybench repo for the code:
https://github.com/MrPowers/querybench. I think
mrpowers-wb commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078052376
@zhuqi-lucas - these benchmarks use Parquet files, see the querybench repo
for the code: https://github.com/MrPowers/querybench. I think Parquet is a lot
better for these b
zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3074225566
> DataFusion is underperforming the Polars streaming engine on some
localhost join queries (1e8 rows of data on a Macbook M3 with 16GB of RAM):
>
> https://private-use
Dandandan commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073881761
Thanks @nuno-faria that's a great insight (for TPC-H / very nested joins we
probably should implement a smarter join order algorithm).
For h2o joins however, it seems it
nuno-faria commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3073363029
I have also been looking at join performance and I think the main limitation
is the order, followed by the lack of join parameterization.
In TPC-H, 6 queries use a bad join
Dandandan commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063915741
Also, referencing the direct indexing / perfect hash join here.
I think that should be relatively simple to implement.
https://github.com/duckdb/duckdb/pull/1959
#816
jonathanc-n commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063872693
Another thing we can do is hash it once and use parts of the hash at a time
during `RepartitionExec` and building the hashtable. This is made even better
with having to do a
Dandandan commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063754335
> > is it using a hash table or open addressing (df doesn't have the latter)
>
> [@XiangpengHao](https://github.com/XiangpengHao) has mentioned several
times that we thi
alamb commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063749417
> is it using a hash table or open addressing (df doesn't have the latter)
@XiangpengHao has mentioned several times that we think DuckDB uses radix
trees (which work l
Dandandan commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3063067577
Besides profiling, I would like to suggest to research how the other engines
are running the join and extract some high level learnings out of it:
* is it using a hash t
jonathanc-n commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3050155137
Ohh, I missed the part in the documentation which had the h2o_small_join
able to downloaded, i thought the h2o_small had it all 😆
--
This is an automated message from the
alamb commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3049026189
> I'm having trouble runnigng flamegraphs on these h2o queries, not sure how
to get these csv files to work with them. Anybody know how?
I think you can generate the data us
comphead commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3046954409
> I'm having trouble runnigng flamegraphs on these h2o queries, not sure how
to get these csv files to work with them. Anybody know how?
if you run them locally you can t
jonathanc-n commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3046943380
I'm having trouble runnigng flamegraphs on these h2o queries, not sure how
to get these csv files to work with them. Anybody know how?
--
This is an automated message from
alamb commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3046572751
FYI @jonathanc-n @UBarney @comphead
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go t
27 matches
Mail list logo