valkum opened a new issue, #17446: URL: https://github.com/apache/datafusion/issues/17446
### Describe the bug I encountered a slow query, which I already mentioned in Discord but got no answer. I went ahead and created a repro case with data I can share that showcases this slow query. Plans: ``` DataFrame() +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: df.name, df.group, CASE WHEN array_element(markets, Int64(1)) IS NOT NULL THEN markets ELSE List() END AS CASE WHEN markets[Int64(1)] IS NOT NULL THEN markets ELSE NULL END | | | Aggregate: groupBy=[[df.name, df.group]], aggr=[[array_agg(market) AS markets]] | | | Projection: df.name, df.group, CASE WHEN df.market IS NOT NULL AND df.price IS NOT NULL THEN named_struct(Utf8("market"), CAST(df.market AS Dictionary(UInt16, Utf8)), Utf8("price"), df.price) ELSE Struct({market:,price:}) END AS market | | | TableScan: df projection=[name, group, market, price] | | physical_plan | ProjectionExec: expr=[name@0 as name, group@1 as group, CASE WHEN array_element(markets@2, 1) IS NOT NULL THEN markets@2 END as CASE WHEN markets[Int64(1)] IS NOT NULL THEN markets ELSE NULL END] | | | AggregateExec: mode=FinalPartitioned, gby=[name@0 as name, group@1 as group], aggr=[markets], ordering_mode=Sorted | | | SortExec: expr=[name@0 ASC NULLS LAST, group@1 ASC NULLS LAST], preserve_partitioning=[true] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([name@0, group@1], 10), input_partitions=10 | | | AggregateExec: mode=Partial, gby=[name@0 as name, group@1 as group], aggr=[markets], ordering_mode=Sorted | | | ProjectionExec: expr=[name@0 as name, group@1 as group, CASE WHEN market@2 IS NOT NULL AND price@3 IS NOT NULL THEN named_struct(market, CAST(market@2 AS Dictionary(UInt16, Utf8)), price, price@3) END as market] | | | DataSourceExec: file_groups={10 groups: [, ...]}, projection=[name, group, market, price], output_ordering=[name@0 ASC NULLS LAST, group@1 ASC NULLS LAST], file_type=parquet | | | | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` Hyperfine reports: ``` Benchmark 1: uv run test_polars.py sample-1m.parquet Time (mean ± σ): 300.2 ms ± 104.0 ms [User: 538.6 ms, System: 140.1 ms] Range (min … max): 249.2 ms … 585.3 ms 10 runs Warning: The first benchmarking run for this command was significantly slower than the rest (585.3 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run. Benchmark 2: uv run test_datafusion.py sample-1m.parquet Time (mean ± σ): 1.320 s ± 0.062 s [User: 4.919 s, System: 0.258 s] Range (min … max): 1.264 s … 1.444 s 10 runs Summary uv run test_polars.py sample-1m.parquet ran 4.40 ± 1.54 times faster than uv run test_datafusion.py sample-1m.parquet ``` ### To Reproduce See https://github.com/valkum/polars-datafusion-comparison/tree/main for full repro case. ### Expected behavior _No response_ ### Additional context I am not sure if a similar file and task is part of the benchmarks. As the 1m sample file is 14MB in size, thus it makes no sense to include such a file in the git repo. A smaller sample file (e.g. 10k) only results a 1.49x difference by Polars. Feel free to store a generated file to include this in the benchmarks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org