twitu commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2147977834

   So I tried out some experiments with different combinations of using `ORDER 
BY` and setting the `repartition_file_scans` configuration value. And the 
results are a bit un-intuitive. The full results, instructions and data is 
hosted in this repo [^1].
   
   The script reads a file and prints the number of rows in the first row 
group. Ideally it should only load the first row group.
   
   However, As you can see, even after specifying the sort order of the file 
the `ORDER BY` query still loads the whole file and does some kind of sort 
operation. You can see that for a 100 MB file it loads up to 900 MB. Turning on 
re-partitioning helps a bit in improving perf and memory footprint.
   
   ```
       let parquet_options = ParquetReadOptions::<'_> {
           skip_metadata: Some(false),
           file_sort_order: vec![vec![Expr::Sort(Sort {
               expr: Box::new(col("ts_init")),
               asc: true,
               nulls_first: true,
           })]],
           ..Default::default()
       };
   ```
   
   But the best result is by removing sorting and turning of re-partitioning. I 
believe it only loads the one row group required. It'll be very helpful to 
document this interaction.
   
   | order | repartition | wall time (s) | memory (mb) | read sorted order |
   | -- | -- | -- | -- | -- |
   | true | true | 0.84 | 654 | ✅ |
   | true | false | 1.19 | 944 | ✅ |
   | false | false | 0.21 | 45 | ✅ |
   | false | true | 0.33 | 151 | ❌ |
   
   I'll share more results for reading the whole file.
   
   [^1]: 
https://github.com/nautechsystems/nautilus_experiments/tree/efficient-query
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to