Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-29 Thread via GitHub
EmeraldShift commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2840412850 > They are currently exploring the possibility of using it alongside projections (a feature in ClickHouse akin to materialized views) to create secondary indexes and simila

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-23 Thread via GitHub
acking-you commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824964842 > Relevant: https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization Thank you so much for sharing this blog linkβ€”it’s truly an ex

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-23 Thread via GitHub
acking-you commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824790924 > I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) > > > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "E

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-05 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740754487 Ah actually, the query given by @xudong963 is slightly off, I think it should be the following (without the explicit join): ``` > EXPLAIN (WITH ids AS (SELECT row_id,

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-04 Thread via GitHub
adriangb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740806655 I think the difference is that DuckDB _dynamically_ pushes down the current state of the TopK heap into file opening as described in #15037 and implemented in #15301 -- This

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-22 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740983293 BTW combined with @adriangb's PR here - https://github.com/apache/datafusion/pull/15301 It will likely go crazy fast πŸš€ -- This is an automated message from the Apache

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2741469672 I traced this down to an issue in the planner, which uses `PartitionMode::Auto` iff stats are collected (`datafusion.execution.collect_statistics`) We can however still use

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740980237 > Thanks for checking [@alamb](https://github.com/alamb) ! > > I think a large portion is spent in the hash join (repartitioning the right side input) - I think because it r

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740900315 I am not really sure where the time is going πŸ€” output of explain analyze: [explain.txt](https://github.com/user-attachments/files/19370532/explain.txt) -- This

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007 I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) ```sql > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "Ev

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740936826 Thanks for checking @alamb ! I think a large portion is spent in the h join (repartitioning the right input) - I think because it runs as `Partitioned` hash join, instea

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740855773 > I did not fully get this part. DF has semi join support and some rewrites to utilize it in similar cases? > The query transformation in SQL as given by @xudong963 is optim

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740739642 > Note that late materialization (the join / semi join rewrite) needs join operator support that DataFusion doesn't yet have (we could add it but it will take non trivial effo

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2718103422 > We can spilt the idea to the query: I agree -- this is what I meant by "late materialization" . Your example / explanation is much better than mine @xudong963 πŸ™ -- Th

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
xudong963 commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2718074072 There is a similar thought named `prewhere`: https://clickhouse.com/docs/sql-reference/statements/select/prewhere. Even though it aims to filter, the idea is similar, fo

[I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
alamb opened a new issue, #15177: URL: https://github.com/apache/datafusion/issues/15177 ### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/14586 [Comparing ClickBench on DataFusion 45 and DuckDB (link)](https://be

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717635254 Note that late materialization (the join / semi join rewrite) needs join operator support that DataFusion doesn't yet have (we could add it but it will take non trivial effort)

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
robert3005 commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717583565 There's two optimizations here that go together, if you check clickbench results duckdb on their own format is significantly faster than parquet. The two optimizer rules that

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717477809 BTW apparently DuckDB uses the "late materialization" technique with its own native format. Here is an explain courtesy of Joe Issacs and Robert Kruszewski ``` β”Œβ”€

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717443917 -- here is the duckdb plan and it shows what they are doing! The key is this line: ``` β”‚ Filters: β”‚ β”‚ optional: Dynamic Filter β”‚ β”‚