acking-you commented on issue #15512:
URL: https://github.com/apache/datafusion/issues/15512#issuecomment-2813004760

   I briefly looked at the descriptions of these optimizations. For example, 
the method of dynamically handling the "order by limit" process using 
statistics is really cool! @alamb 
   
   ## Idea
   But I have some new ideas that seem to be more universally applicable to 
`order by limit`(q23):
   ```sql
   ┌──────────────────────────────────────┐
   │ [Step 1] Filter RowGroups            │
   │ - Use Parquet metadata to skip RGs   │
   │   WHERE RG.min(EventTime) > cutoff OR│
   │   RG lacks URL stats for '%google%'  │
   └───────────────────┬──────────────────┘
                       ↓
   ┌──────────────────────────────────────┐
   │ [Step 2] Read EventTime + Filter     │
   │ - Scan EventTime column in valid RGs │
   │ - Apply URL LIKE '%google%' filter   │
   │ - Sort values → Track top 10 rows    │
   └───────────────────┬──────────────────┘
                       ↓
   ┌──────────────────────────────────────┐
   │ [Step 3] Record Row Locations        │
   │ - Map top 10 EventTime to physical   │
   │   positions (RG_ID + Row_Offset)     │
   └───────────────────┬──────────────────┘
                       ↓
   ┌──────────────────────────────────────┐
   │ [Step 4] Fetch 10 Rows               │
   │ - Directly read rows from Parquet    │
   │   via recorded positions (non-seq)   │
   └───────────────────┬──────────────────┘
                       ↓
                 Final Result
   ```
   
   ## Explore
   Currently, q23 takes approximately 6 seconds to execute. I have confirmed 
that DataFusion does not have the aforementioned optimizations and still scans 
a very large number of rows and columns. By the way, is there a convenient way 
in `datafusion-cli` to view statistics on the number of rows and columns 
scanned? Currently, I directly print the batch information in the `Like` 
expression, which gives the following output (it seems endless, and the amount 
of data being scanned appears to be very large, all with exactly 105 columns):
   
   
![Image](https://github.com/user-attachments/assets/d70e8ad8-1d35-46c1-961b-13575bcac2dd)
   
   ## Some concerns
   Parquet is composed of RowGroups. Is it difficult to read an individual 
page? In my previous work, I’ve seen optimizations for this scenario (based on 
DataFusion), but it used a custom columnar storage format, which was easier to 
implement. At that time, when working with very large datasets (similar to the 
hits dataset), the query time for "order by limit" was reduced to around 2 
seconds.
   
   ## Summary
   
   The reading process of the entire data can be delayed by using "order by" on 
columns, which is very effective for the "order by limit" scenario. I'm not 
sure if DataFusion is currently doing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to