Re: [I] 2gb parquet file takes 100s to process, even on second attempt (on main) [datafusion]

via GitHub Sat, 14 Dec 2024 10:25:40 -0800


Dandandan commented on issue #13785:
URL: https://github.com/apache/datafusion/issues/13785#issuecomment-2543275736


   Hi @TheBuilderJR thanks for opening the issue.
   
   Is there a way we could reproduce your results?
   Did you compare performance to other engines (e.g. Spark, DuckDB)?
   
   Let me try to address some of it:
   
   > SELECT * FROM table ORDER by timestamp two times, see both times take over 
100s
   
   This is an expensive query because it has to:
   1. Scan all data / all columns (because it has no column selection, no 
`WHERE` or `LIMIT`)
   2. Order the entire dataset based on timestamp
   
   > But I expected second time to at least be faster
   
   DataFusion is a stateless query engine, so it won't cache anything, so the 
second query often doesn't run much faster than the first.
   
   > Ideally first time also utilizes the file statistics to run faster.
   I don't think statistics could be used for a query like this.
   
   >  I upgraded 3 major version bumps in one go and expected some sort of 
noticeable improvement.
   
   Recent versions mostly improves aggregation `GROUP BY` performance, so you 
could notice this queries that use `GROUP BY`. The query in the example depends 
heavilty on Parquet and `ORDER BY` being fast and probably won't see much of an 
improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] 2gb parquet file takes 100s to process, even on second attempt (on main) [datafusion]

Reply via email to