[PR] [SPARK-47867][SQL] Support variant in JSON scan. [spark]

via GitHub Mon, 15 Apr 2024 19:17:23 -0700


chenhao-db opened a new pull request, #46071:
URL: https://github.com/apache/spark/pull/46071


   ### What changes were proposed in this pull request?
   
   This PR adds support for the variant type in the JSON scan.
   
   As part of this PR we introduce one new JSON option: 
`spark.read.format("json").option("singleVariantColumn", "colName")`. Setting 
this option specifies that each JSON document should be ingested into a single 
variant column called `colName`. When this option is used, the user must not 
specify a schema, and the schema is inferred as `colName variant`.
   
   ### Example 1 (multiple variant fields)
   
   JSON files can be ingested into variant fields, e.g.
   ```
   spark.read.format("json").schema("i int, var variant, arr 
ARRAY<variant>").load("a.json").show(false)
   ```
   for a file with the following data:
   ```
   {"i": 1, "var": {"d": "+94875-04-12", 
"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, "arr": 
[{"a": 1}, {"b": 2}, {"c": 3, "d": [1, 2, 3]}]}
   {"i": 2, "var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": 
"value2"}}}
   {}
   {"i": 3}
   ```
   
   ### Example 2 (one variant field)
   
   Here's another example with a single variant field:
   ```
   spark.read.format("json").schema("var variant").load("a.json").show(false)
   ```
   for a file with the following data:
   ```
   {"var": {"d": "+94875-04-12", 
"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}}
   {"var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": 
"value2"}}}
   {}
   ```
   
   ### Example 3 (singleVariantColumn option)
   Each JSON document can also be ingested into a single variant column, e.g.
   ```
   spark.read.format("json").option("singleVariantColumn", 
"var").load("a.json").show(false)
   ```
   for a file with the following data:
   ```
   {"i": 1, "var": {"d": "+94875-04-12", 
"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, "arr": 
[{"a": 1}, {"b": 2}, {"c": 3, "d": [1, 2, 3]}]}
   {"i": 2, "var": {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": 
"value2"}}}
   {}
   {"i": 3}
   ```
   
   ### Why are the changes needed?
   
   It allows Spark to ingest variant values directly from the JSON data source. 
Previously, the `parse_json` expression can only operate on a string column 
that is already in an existing table.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   Unit tests that verify the result and error reporting in JSON scan.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-47867][SQL] Support variant in JSON scan. [spark]

Reply via email to