[I] Row groups are read out of order or with completely different values [datafusion]

via GitHub Sat, 18 May 2024 11:36:13 -0700


twitu opened a new issue, #10572:
URL: https://github.com/apache/datafusion/issues/10572


   ### Describe the bug
   
   Datafusion is reading row groups out of order and sometimes with completely 
different values for the row groups. The data is verified by reading the same 
files using the Python `pyarrow.parquet` library.
   
   The `pyarrrow` and `datafusion` reader read the same values when the file 
has 126 row groups. But give complete different values when the file has 127 
row groups.
   
   
   ### To Reproduce
   
   https://github.com/nautechsystems/nautilus_experiments/tree/datafusion-bug
   
   The steps, data and results are documented in this repo and branch. The 
README is shared here again.
   
   ---
   
   Use the python script to extract row group information from the parquet 
files using pyarrow.
   
   ```bash
   pip install -r requirements.txt
   python extract_ts_init.py 126-groups.parquet 126-groups-python.csv
   python extract_ts_init.py 127-groups.parquet 127-groups-python.csv
   ```
   
   Run the rust executable to extract row group information from the parquet 
files using datafusion.
   
   ```bash
   cargo run 126-groups.parquet > 126-groups-rust.csv
   cargo run 127-groups.parquet > 127-groups-rust.csv
   ```
   
   Ideally there should be no difference between the csv files for the row 
groups. However, 126 works properly. But 127 gives different results for Python 
and Rust.
   
   This shows that indeed there's no difference with 126 groups.
   
   ```bash
   diff 126-groups-rust.csv 126-groups-python.csv # no diff
   diff 126-groups-rust.csv 126-groups-python.csv # big diff, things crazy
   ```
   
   We can also make sure that these are in fact from the same data source with 
just one extra row group with this command which shows 127 groups python has 
only one extra entry at the end.
   
   ```bash
   diff 126-groups-python.csv 127-groups-python.csv
   ```
   
   
   
   ### Expected behavior
   
   Datafusion reader should
   * read row groups in order they were written
   * read the same values for the same row group even when the file increases 
in size
   * read the same values as the python `pyarrow` parquet reader
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Row groups are read out of order or with completely different values [datafusion]

Reply via email to