alamb commented on issue #75: URL: https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424
I think https://github.com/apache/parquet-testing/pull/76 is close to mergable (thank you @emkornfield for the review). However, As @neilechao pointed out on a recent Parquet call, even after https://github.com/apache/parquet-testing/pull/76 there are still no example parquet files that have variant values. As described above, it is important to have actual parquet files too. The reason I didn't add parquet files is that I could not figure out how to create a properly annotated parquet file with Apache Spark. Perhaps we could use parquet-cpp library now that https://github.com/apache/arrow/pull/45375 (also from @neilechao) has merged 🤔 # What I have tried so far I tried to get Spark to write a Parquet file with correct annotations using the regen.py script from https://github.com/apache/parquet-testing/pull/76: ```python spark.sql("SELECT * FROM output").repartition(1).write.parquet('variant.parquet') ``` This results in this file (needed `.txt` extension to upload to github): [part-00000-6df14c7c-3b09-4678-b311-0ac6199a7857-c000.snappy.parquet.txt](https://github.com/user-attachments/files/20016594/part-00000-6df14c7c-3b09-4678-b311-0ac6199a7857-c000.snappy.parquet.txt) This file's schema doesn't have the [Variant logical annotation (source)](https://github.com/apache/parquet-format/blob/3ce0760933b875bc8a11f5be0b883cd107b95b43/src/main/thrift/parquet.thrift#L406-L413) which we can see from: For example: ```shell parquet-dump-schema variant.parquet/part-00000-855cbfbf-1e1b-4557-87bb-b6e83aa5fb9c-c000.snappy.parquet ``` ``` required group field_id=-1 spark_schema { optional binary field_id=-1 name (String); optional group field_id=-1 variant_col { <-- this field should have the Variant logical type annotation required binary field_id=-1 value; required binary field_id=-1 metadata; } optional binary field_id=-1 json_col (String); } ``` It DOES have some spark specific metadata, which I think is how Spark detects that the column contains a variant: Key: `org.apache.spark.sql.parquet.row.metadata` Value : ```json { "type":"struct", "fields":[ {"name":"name","type":"string","nullable":true,"metadata":{"__CHAR_VARCHAR_TYPE_STRING":"varchar(2000)"}}, {"name":"variant_col","type":"variant","nullable":true,"metadata":{}}, {"name":"json_col","type":"string","nullable":true,"metadata":{}} ] } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
