EmeraldShift commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2840412850
> They are currently exploring the possibility of using it alongside
projections (a feature in ClickHouse akin to materialized views) to create
secondary indexes and simila
acking-you commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824964842
> Relevant:
https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization
Thank you so much for sharing this blog linkβitβs truly an ex
acking-you commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824790924
> I tried the rewrite into a Semi join and indeed it is over 2x slower
(5.3sec vs 12sec)
>
> > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY
"E
Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740754487
Ah actually, the query given by @xudong963 is slightly off, I think it
should be the following (without the explicit join):
```
> EXPLAIN (WITH ids AS (SELECT row_id,
adriangb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740806655
I think the difference is that DuckDB _dynamically_ pushes down the current
state of the TopK heap into file opening as described in #15037 and implemented
in #15301
--
This
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740983293
BTW combined with @adriangb's PR here
- https://github.com/apache/datafusion/pull/15301
It will likely go crazy fast π
--
This is an automated message from the Apache
Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2741469672
I traced this down to an issue in the planner, which uses
`PartitionMode::Auto` iff stats are collected
(`datafusion.execution.collect_statistics`)
We can however still use
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740980237
> Thanks for checking [@alamb](https://github.com/alamb) !
>
> I think a large portion is spent in the hash join (repartitioning the
right side input) - I think because it r
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740900315
I am not really sure where the time is going π€
output of explain analyze:
[explain.txt](https://github.com/user-attachments/files/19370532/explain.txt)
--
This
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007
I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec
vs 12sec)
```sql
> SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY
"Ev
Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740936826
Thanks for checking @alamb !
I think a large portion is spent in the h join (repartitioning the right
input) - I think because it runs as `Partitioned` hash join, instea
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740855773
> I did not fully get this part. DF has semi join support and some rewrites
to utilize it in similar cases?
> The query transformation in SQL as given by @xudong963 is optim
Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740739642
> Note that late materialization (the join / semi join rewrite) needs join
operator support that DataFusion doesn't yet have (we could add it but it will
take non trivial effo
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2718103422
> We can spilt the idea to the query:
I agree -- this is what I meant by "late materialization" . Your example /
explanation is much better than mine @xudong963 π
--
Th
xudong963 commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2718074072
There is a similar thought named `prewhere`:
https://clickhouse.com/docs/sql-reference/statements/select/prewhere.
Even though it aims to filter, the idea is similar, fo
alamb opened a new issue, #15177:
URL: https://github.com/apache/datafusion/issues/15177
### Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/14586
[Comparing ClickBench on DataFusion 45 and DuckDB
(link)](https://be
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717635254
Note that late materialization (the join / semi join rewrite) needs join
operator support that DataFusion doesn't yet have (we could add it but it will
take non trivial effort)
robert3005 commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717583565
There's two optimizations here that go together, if you check clickbench
results duckdb on their own format is significantly faster than parquet. The
two optimizer rules that
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717477809
BTW apparently DuckDB uses the "late materialization" technique with its own
native format. Here is an explain courtesy of Joe Issacs and Robert Kruszewski
```
ββ
alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717443917
-- here is the duckdb plan and it shows what they are doing!
The key is this line:
```
β Filters: β
β optional: Dynamic Filter β
β
20 matches
Mail list logo