[ https://issues.apache.org/jira/browse/SPARK-42879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jiri Humpolicek updated SPARK-42879: ------------------------------------ Affects Version/s: 3.5.4 > Spark SQL reads unnecessary nested fields > ----------------------------------------- > > Key: SPARK-42879 > URL: https://issues.apache.org/jira/browse/SPARK-42879 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.3.2, 4.0.0, 3.5.2, 3.5.4 > Reporter: Jiri Humpolicek > Priority: Major > > When we use more than one field from structure after explode, all fields will > be read. > Example: > 1) Loading data > {code:scala} > val jsonStr = """{ > "items": [ > {"itemId": 1, "itemData1": "a", "itemData2": 11}, > {"itemId": 2, "itemData1": "b", "itemData2": 22} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {code} > 2) read query with explain > {code:scala} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read > .select(explode('items).as('item)) > .select($"item.itemId", $"item.itemData1") > .explain > // ReadSchema: > struct<items:array<struct<itemData1:string,itemData2:bigint,itemId:bigint>>> > {code} > We use only *itemId* and *itemData1* fields from structure in array, but read > schema contains *itemData2* field as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org