[I] Querying Parquet file specifically with a predicate returns invalid data error but works in other situations [datafusion]

via GitHub Fri, 24 Jan 2025 09:27:13 -0800


senyosimpson opened a new issue, #14281:
URL: https://github.com/apache/datafusion/issues/14281


   ### Describe the bug
   
   When making a query _with a predicate_ against Parquet files generated with 
[parquet-go](https://github.com/parquet-go/parquet-go) , DataFusion errors 
saying the data is invalid. However, without a predicate, it works fine.
   
   When using the CLI, I get the error:
   
   ```
   » datafusion-cli --command "select * from 
'go-parquet-writer/go-testfile.parquet' where age > 10"
   DataFusion CLI v44.0.0
   Error: External error: Parquet error: External: bad data
   ```
   
   In my application, it is more descriptive, showing:
   
   ```
   ParquetError(External(ProtocolError { kind: InvalidData, message: "cannot 
convert 2 into TType" }))
   ```
   
   However, it appears that the file is intact. The metadata is successfully 
read and interpreted
   
   ```
   » datafusion-cli --command "describe 'go-parquet-writer/go-testfile.parquet'"
   DataFusion CLI v44.0.0
   +---------------+-------------------------------------+-------------+
   | column_name   | data_type                           | is_nullable |
   +---------------+-------------------------------------+-------------+
   | city          | Utf8View                            | NO          |
   | country       | Utf8View                            | NO          |
   | age           | UInt8                               | NO          |
   | scale         | Int16                               | NO          |
   | status        | UInt32                              | NO          |
   | time_captured | Timestamp(Millisecond, Some("UTC")) | NO          |
   | checked       | Boolean                             | NO          |
   +---------------+-------------------------------------+-------------+
   7 row(s) fetched.
   Elapsed 0.001 seconds.
   ```
   
   When I run without a predicate, I get back the data
   
   ```
   » datafusion-cli --command "select * from 
'go-parquet-writer/go-testfile.parquet'"
   DataFusion CLI v44.0.0
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   | city   | country | age | scale | status | time_captured            | 
checked |
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   | Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false 
  |
   | Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true  
  |
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   2 row(s) fetched.
   Elapsed 0.002 seconds.
   ```
   
   It even works if I use `ORDER BY` and `GROUP BY`
   
   ```
   » datafusion-cli --command "select * from 
'go-parquet-writer/go-testfile.parquet' ORDER BY age DESC"
   DataFusion CLI v44.0.0
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   | city   | country | age | scale | status | time_captured            | 
checked |
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   | Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true  
  |
   | Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false 
  |
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   2 row(s) fetched.
   Elapsed 0.010 seconds.
   
   » datafusion-cli --command "select city, SUM(age) AS age from 
'go-parquet-writer/go-testfile.parquet' GROUP BY city"
   DataFusion CLI v44.0.0
   +--------+-----+
   | city   | age |
   +--------+-----+
   | Athens | 32  |
   | Madrid | 10  |
   +--------+-----+
   2 row(s) fetched.
   Elapsed 0.004 seconds.
   ```
   
   Additionally, this works when I use `PyArrow` and `Pandas` to load the 
Parquet file and filter it.
   
   ### To Reproduce
   
   The issue can be reproduced by creating a Parquet file with the `parquet-go` 
library and attempting to query it with a predicate in the query. To simplify, 
I created a [public repo](https://github.com/senyosimpson/fusion-repro) that 
has code to generate the file and similar examples in the README as shown in 
this report. A test file can be found in 
`go-parquet-writer/go-testfile.parquet`, generated by the Go program in that 
directory.
   
   I've also gone through the effort of trying to achieve the same using 
PyArrow and Pandas (which you'll see in the repo under `pyarrow-ex`) to verify 
the Parquet file is not corrupted in some way. This works as expected.
   
   ### Expected behavior
   
   The parquet file's created by `parquet-go` can successfully be queried when 
the query contains a predicate.
   
   ### Additional context
   
   From everything I've gathered, this error is likely coming from this 
[conversion 
function](https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/rs/src/protocol/compact.rs#L660-L679).
 However, it only skips checking `0x02` when a [collection is being 
parsed](https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/rs/src/protocol/compact.rs#L653-L658).
 Weirdly, I don't have any list/map/set in my schema. I assume this means this 
`0x02` is being used to encode something else but it is beyond my knowledge.
   
   I went spelunking in `parquet-go` codebase. The Thrift protocol 
implementation is split amongst [the compact 
protocol](https://github.com/parquet-go/parquet-go/blob/main/encoding/thrift/compact.go),
 [the Thrift type 
definitions](https://github.com/parquet-go/parquet-go/blob/main/encoding/thrift/thrift.go)
 and [the encoding 
logic](https://github.com/parquet-go/parquet-go/blob/main/encoding/thrift/encode.go)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Querying Parquet file specifically with a predicate returns invalid data error but works in other situations [datafusion]

Reply via email to