parthchandra opened a new issue, #3142:
URL: https://github.com/apache/parquet-java/issues/3142

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The DataFusion Comet project's unit test use the ExampleParquetWriter to 
create Parquet files - 
https://github.com/apache/datafusion-comet/blob/996362e78d497c02542f1e29dbb7cba3ec16f64c/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L432
   This was inspired by similar unit test code in Spark - 
https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala#L871
   
   The output files can contain uint8 and uint16 values that are illegal per 
the spec. For example in this file - 
   
   
[alltypes_extended_plain.parquet.zip](https://github.com/user-attachments/files/18623383/alltypes_extended_plain.parquet.zip)
   
   The columns `_8` and `_9` are `uint_8` and `uint_16` values and contain 
illegal negative values. 
   
   ```
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   {"_1": true, "_2": 18, "_3": 10002, "_4": 10002, "_5": 10002, "_6": 10002.0, 
"_7": 10002.0, "_8": 
"100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002",
 "_9": -18, "_10": -10002, "_11": -10002, "_12": -10002, "_13": "10002", "_14": 
[50, 50, 50], "_15": 10002, "_16": 10002, "_17": [50, 50, 50, 50, 50, 50, 50, 
50, 50, 50, 50, 50, 50, 50, 50, 50], "_18": 10002, "_19": 10002, "_20": 10002}
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   {"_1": true, "_2": 20, "_3": 10004, "_4": 10004, "_5": 10004, "_6": 10004.0, 
"_7": 10004.0, "_8": 
"100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004",
 "_9": -20, "_10": -10004, "_11": -10004, "_12": -10004, "_13": "10004", "_14": 
[52, 52, 52], "_15": 10004, "_16": 10004, "_17": [52, 52, 52, 52, 52, 52, 52, 
52, 52, 52, 52, 52, 52, 52, 52, 52], "_18": 10004, "_19": 10004, "_20": 10004}
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   {"_1": true, "_2": 24, "_3": 10008, "_4": 10008, "_5": 10008, "_6": 10008.0, 
"_7": 10008.0, "_8": 
"100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008",
 "_9": -24, "_10": -10008, "_11": -10008, "_12": -10008, "_13": "10008", "_14": 
[56, 56, 56], "_15": 10008, "_16": 10008, "_17": [56, 56, 56, 56, 56, 56, 56, 
56, 56, 56, 56, 56, 56, 56, 56, 56], "_18": 10008, "_19": 10008, "_20": 10008}
   {"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, 
"_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, 
"_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, 
"_19": null, "_20": null}
   ```
   
   Taking as an example the first value for column `_8` the bit pattern written 
to the file is `0xffffffee` which gets read as a negative value which is 
illegal for a unsigned int. 
   
   The value originates in this line - 
https://github.com/apache/datafusion-comet/blob/996362e78d497c02542f1e29dbb7cba3ec16f64c/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L520
 
   where a negative value is cast to a byte and then written to Parquet.  The 
Parquet writer needs to cast correctly to a larger type before writing to the 
file. 
   
   The values written can be read by the Parquet-java reader but other 
implementations are free to return an error or null for such values which is 
not desirable. 
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org

Reply via email to