alamb commented on issue #10572: URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2119371107
Thank you for the report and the reproducer ❤️ > read row groups in order they were written This is not my expectation. DataFusion reads row groups in parallel, potentially out of order, with multiple threads as an optimization. To preserve the order of the data you can either set the [configuration](https://datafusion.apache.org/user-guide/configs.html) `datafusion.optimizer.repartition_file_scans` to `false` or else communicate the order of the data in the files using the `CREATE EXTERNAL TABLE .. WITH ORDER` clause and then explicitly ask for that order in your query. > read the same values for the same row group even when the file increases in size > read the same values as the python pyarrow parquet reader Yes I agree these are also my expectation Maybe you can try setting `datafusion.optimizer.repartition_file_scans` to `false` and see if that makes the data consistent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
