adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2855182639
This seems to be working and provides a tangible performance improvement BUT I don't think it's as good as it could be because of the Optimizer ordering. In particular: ``` explain SELECT "EventTime" FROM 'benchmarks/data/hits_partitioned/' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10; ``` ``` +---------------+-------------------------------+ | plan_type | plan | +---------------+-------------------------------+ | physical_plan | ┌───────────────────────────┐ | | | │ SortPreservingMergeExec │ | | | │ -------------------- │ | | | │ EventTime ASC NULLS │ | | | │ LASTlimit: │ | | | │ 10 │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ SortExec(TopK) │ | | | │ -------------------- │ | | | │ EventTime@0 ASC NULLS LAST│ | | | │ │ | | | │ limit: 10 │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ CoalesceBatchesExec │ | | | │ -------------------- │ | | | │ target_batch_size: │ | | | │ 8192 │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ ProjectionExec │ | | | │ -------------------- │ | | | │ EventTime: EventTime │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ DataSourceExec │ | | | │ -------------------- │ | | | │ files: 111 │ | | | │ format: parquet │ | | | │ │ | | | │ predicate: │ | | | │ CAST(URL AS Utf8View) LIKE│ | | | │ %google% AND true │ | | | └───────────────────────────┘ | | | | +---------------+-------------------------------+ ``` I think the extra `CoalesceBatchesExec` may be hurting perf, but not sure... some more measurement is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org