twitu opened a new issue, #10572: URL: https://github.com/apache/datafusion/issues/10572
### Describe the bug Datafusion is reading row groups out of order and sometimes with completely different values for the row groups. The data is verified by reading the same files using the Python `pyarrow.parquet` library. The `pyarrrow` and `datafusion` reader read the same values when the file has 126 row groups. But give complete different values when the file has 127 row groups. ### To Reproduce https://github.com/nautechsystems/nautilus_experiments/tree/datafusion-bug The steps, data and results are documented in this repo and branch. The README is shared here again. --- Use the python script to extract row group information from the parquet files using pyarrow. ```bash pip install -r requirements.txt python extract_ts_init.py 126-groups.parquet 126-groups-python.csv python extract_ts_init.py 127-groups.parquet 127-groups-python.csv ``` Run the rust executable to extract row group information from the parquet files using datafusion. ```bash cargo run 126-groups.parquet > 126-groups-rust.csv cargo run 127-groups.parquet > 127-groups-rust.csv ``` Ideally there should be no difference between the csv files for the row groups. However, 126 works properly. But 127 gives different results for Python and Rust. This shows that indeed there's no difference with 126 groups. ```bash diff 126-groups-rust.csv 126-groups-python.csv # no diff diff 126-groups-rust.csv 126-groups-python.csv # big diff, things crazy ``` We can also make sure that these are in fact from the same data source with just one extra row group with this command which shows 127 groups python has only one extra entry at the end. ```bash diff 126-groups-python.csv 127-groups-python.csv ``` ### Expected behavior Datafusion reader should * read row groups in order they were written * read the same values for the same row group even when the file increases in size * read the same values as the python `pyarrow` parquet reader ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
