twitu commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2119764091
Setting `datafusion.optimizer.repartition_file_scans` to `false` like this
fixes things. :heavy_check_mark:
```rust
let session_cfg =
SessionConfig::new().set_str("datafusion.optimizer.repartition_file_scans",
"false");
let session_ctx = SessionContext::new_with_config(session_cfg);
```
However, it's unclear how it interacts with other options and affects memory
and performance. So here's what I have -
It is a given that the data will be sorted based on timestamp like this
```rust
let parquet_options = ParquetReadOptions::<'_> {
skip_metadata: Some(false),
file_sort_order: vec![vec![Expr::Sort(Sort {
expr: Box::new(col("ts_init")),
asc: true,
nulls_first: true,
})]],
..Default::default()
};
```
Then there are two approaches to get row groups/data in order -
* Using an order by clause in the query `session_ctx.sql("SELECT * FROM data
ORDER BY ts_init")`. From our [previous discussion], doing an order by on an
already sorted column does not incur an additional overhead.
* Setting `datafusion.optimizer.repartition_file_scans` to `false` ensures
that the data is read in sequential order of row groups.
It's not clear to me how each option affects the performance and memory
usage. Do you have any guidance around it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]