alamb commented on issue #75:
URL: https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424

   I think https://github.com/apache/parquet-testing/pull/76 is close to 
mergable (thank you @emkornfield  for the review). 
   
   However, As @neilechao pointed out on a recent Parquet call, even after 
https://github.com/apache/parquet-testing/pull/76 there are still no example 
parquet files that have variant values. As described above, it is important to 
have actual parquet files too. 
   
   The reason I didn't add parquet files is that I could not figure out how to 
create a properly annotated parquet file with Apache Spark. Perhaps we could 
use parquet-cpp library now that https://github.com/apache/arrow/pull/45375 
(also from @neilechao) has merged 🤔 
   
   # What I have tried so far
   
   I tried to get Spark to write a Parquet file with correct annotations using 
the regen.py script from https://github.com/apache/parquet-testing/pull/76:
   
   ```python
   spark.sql("SELECT * FROM 
output").repartition(1).write.parquet('variant.parquet')
   ```
   
   This results in this file (needed `.txt` extension to upload to github): 
[part-00000-6df14c7c-3b09-4678-b311-0ac6199a7857-c000.snappy.parquet.txt](https://github.com/user-attachments/files/20016594/part-00000-6df14c7c-3b09-4678-b311-0ac6199a7857-c000.snappy.parquet.txt)
   
   This file's schema doesn't have the [Variant logical annotation 
(source)](https://github.com/apache/parquet-format/blob/3ce0760933b875bc8a11f5be0b883cd107b95b43/src/main/thrift/parquet.thrift#L406-L413)
 which we can see from:
   
   For example:
   
   ```shell
   parquet-dump-schema  
variant.parquet/part-00000-855cbfbf-1e1b-4557-87bb-b6e83aa5fb9c-c000.snappy.parquet
   
   ```
   
   ```
   required group field_id=-1 spark_schema {
     optional binary field_id=-1 name (String);
     optional group field_id=-1 variant_col {     <-- this field should have 
the Variant logical type annotation
       required binary field_id=-1 value;
       required binary field_id=-1 metadata;
     }
     optional binary field_id=-1 json_col (String);
   }
   ```
   
   It DOES have some spark specific metadata, which I think is how Spark 
detects that the column contains a variant:
   
   Key: `org.apache.spark.sql.parquet.row.metadata`
   Value :
   ```json
    {
    "type":"struct",
    "fields":[
       
{"name":"name","type":"string","nullable":true,"metadata":{"__CHAR_VARCHAR_TYPE_STRING":"varchar(2000)"}},
 
       {"name":"variant_col","type":"variant","nullable":true,"metadata":{}},
       {"name":"json_col","type":"string","nullable":true,"metadata":{}}
     ]
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to