Re: [I] Row groups are read out of order or with completely different values [datafusion]

via GitHub Sun, 19 May 2024 14:55:21 -0700


alamb commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2119371107


   Thank you for the report and the reproducer ❤️ 
   
   > read row groups in order they were written
   
   This is not my expectation. 
   
   DataFusion reads row groups in parallel, potentially out of order, with 
multiple threads as an optimization. To preserve the order of the data you can 
either set the 
[configuration](https://datafusion.apache.org/user-guide/configs.html) 
`datafusion.optimizer.repartition_file_scans`  to `false`  or else communicate 
the order of the data in the files using the `CREATE EXTERNAL TABLE .. WITH 
ORDER` clause and then explicitly ask for that order in your query.
   
   > read the same values for the same row group even when the file increases 
in size
   > read the same values as the python pyarrow parquet reader
   
   Yes I agree these are also my expectation
   
   Maybe you can try setting `datafusion.optimizer.repartition_file_scans`  to 
`false` and see if that makes the data consistent
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Row groups are read out of order or with completely different values [datafusion]

Reply via email to