acking-you commented on issue #15512: URL: https://github.com/apache/datafusion/issues/15512#issuecomment-2813004760
I briefly looked at the descriptions of these optimizations. For example, the method of dynamically handling the "order by limit" process using statistics is really cool! @alamb ## Idea But I have some new ideas that seem to be more universally applicable to `order by limit`(q23): ```sql ┌──────────────────────────────────────┐ │ [Step 1] Filter RowGroups │ │ - Use Parquet metadata to skip RGs │ │ WHERE RG.min(EventTime) > cutoff OR│ │ RG lacks URL stats for '%google%' │ └───────────────────┬──────────────────┘ ↓ ┌──────────────────────────────────────┐ │ [Step 2] Read EventTime + Filter │ │ - Scan EventTime column in valid RGs │ │ - Apply URL LIKE '%google%' filter │ │ - Sort values → Track top 10 rows │ └───────────────────┬──────────────────┘ ↓ ┌──────────────────────────────────────┐ │ [Step 3] Record Row Locations │ │ - Map top 10 EventTime to physical │ │ positions (RG_ID + Row_Offset) │ └───────────────────┬──────────────────┘ ↓ ┌──────────────────────────────────────┐ │ [Step 4] Fetch 10 Rows │ │ - Directly read rows from Parquet │ │ via recorded positions (non-seq) │ └───────────────────┬──────────────────┘ ↓ Final Result ``` ## Explore Currently, q23 takes approximately 6 seconds to execute. I have confirmed that DataFusion does not have the aforementioned optimizations and still scans a very large number of rows and columns. By the way, is there a convenient way in `datafusion-cli` to view statistics on the number of rows and columns scanned? Currently, I directly print the batch information in the `Like` expression, which gives the following output (it seems endless, and the amount of data being scanned appears to be very large, all with exactly 105 columns):  ## Some concerns Parquet is composed of RowGroups. Is it difficult to read an individual page? In my previous work, I’ve seen optimizations for this scenario (based on DataFusion), but it used a custom columnar storage format, which was easier to implement. At that time, when working with very large datasets (similar to the hits dataset), the query time for "order by limit" was reduced to around 2 seconds. ## Summary The reading process of the entire data can be delayed by using "order by" on columns, which is very effective for the "order by limit" scenario. I'm not sure if DataFusion is currently doing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org