twitu commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2119764091

   Setting `datafusion.optimizer.repartition_file_scans` to `false` like this 
fixes things. :heavy_check_mark: 
   
   ```rust
       let session_cfg =
           
SessionConfig::new().set_str("datafusion.optimizer.repartition_file_scans", 
"false");
       let session_ctx = SessionContext::new_with_config(session_cfg);
   ```
   
   However, it's unclear how it interacts with other options and affects memory 
and performance. So here's what I have -
   
   It is a given that the data will be sorted based on timestamp like this
   
   ```rust
       let parquet_options = ParquetReadOptions::<'_> {
           skip_metadata: Some(false),
           file_sort_order: vec![vec![Expr::Sort(Sort {
               expr: Box::new(col("ts_init")),
               asc: true,
               nulls_first: true,
           })]],
           ..Default::default()
       };
   ```
   
   Then there are two approaches to get row groups/data in order -
   
   * Using an order by clause in the query `session_ctx.sql("SELECT * FROM data 
ORDER BY ts_init")`. From our [previous discussion], doing an order by on an 
already sorted column does not incur an additional overhead.
   
   * Setting `datafusion.optimizer.repartition_file_scans` to `false` ensures 
that the data is read in sequential order of row groups.
   
   It's not clear to me how each option affects the performance and memory 
usage. Do you have any guidance around it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to