Dandandan commented on issue #13785: URL: https://github.com/apache/datafusion/issues/13785#issuecomment-2543275736
Hi @TheBuilderJR thanks for opening the issue. Is there a way we could reproduce your results? Did you compare performance to other engines (e.g. Spark, DuckDB)? Let me try to address some of it: > SELECT * FROM table ORDER by timestamp two times, see both times take over 100s This is an expensive query because it has to: 1. Scan all data / all columns (because it has no column selection, no `WHERE` or `LIMIT`) 2. Order the entire dataset based on timestamp > But I expected second time to at least be faster DataFusion is a stateless query engine, so it won't cache anything, so the second query often doesn't run much faster than the first. > Ideally first time also utilizes the file statistics to run faster. I don't think statistics could be used for a query like this. > I upgraded 3 major version bumps in one go and expected some sort of noticeable improvement. Recent versions mostly improves aggregation `GROUP BY` performance, so you could notice this queries that use `GROUP BY`. The query in the example depends heavilty on Parquet and `ORDER BY` being fast and probably won't see much of an improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org