[ https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930987#comment-17930987 ]
Sakthi commented on SPARK-48091: -------------------------------- Worth noting that the issue is fixed in current main (master) branch: Pyspark {code:java} >>> df.select( ... F.transform( ... 'array2', ... lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ... ).alias("new_array2") ... ).printSchema() root |-- new_array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- some_alias: long (nullable = true) | | |-- second_alias: long (nullable = true) >>> df.select( ... F.explode("array1").alias("exploded"), ... F.transform( ... 'array2', ... lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ... ).alias("new_array2") ... ).printSchema() root |-- exploded: string (nullable = true) |-- new_array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- some_alias: long (nullable = true) | | |-- second_alias: long (nullable = true) {code} Scala {code:java} scala> import org.apache.spark.sql.functions._ scala> val df = Seq((1, Seq("a", "b"), Seq(2, 3, 4))).toDF("id", "array1", "array2") scala> df.show(false) +---+------+---------+ |id |array1|array2 | +---+------+---------+ |1 |[a, b]|[2, 3, 4]| +---+------+---------+ scala> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"), array(lit(1), lit(2), lit(3)).as("my_array2")) scala> df2.select( | explode($"my_array").as("exploded"), | transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct") | ).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- exploded: integer (nullable = false) |-- my_struct: array (nullable = false) | |-- element: struct (containsNull = false) | | |-- data: integer (nullable = false) {code} Let me know if things still don't look okay. > Using `explode` together with `transform` in the same select statement causes > aliases in the transformed column to be ignored > ----------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-48091 > URL: https://issues.apache.org/jira/browse/SPARK-48091 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.4.0, 3.5.0, 3.5.1 > Environment: Scala 2.12.15, Python 3.10, 3.12, OSX 14.4 and > Databricks DBR 13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1 > Reporter: Ron Serruya > Priority: Minor > Labels: alias > > When using an `explode` function, and `transform` function in the same select > statement, aliases used inside the transformed column are ignored. > This behavior only happens using the pyspark API and the scala API, but not > when using the SQL API > > {code:java} > from pyspark.sql import functions as F > # Create the df > df = spark.createDataFrame([ > {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} > ]){code} > Good case, where all aliases are used > > {code:java} > df.select( > F.transform( > 'array2', > lambda x: F.struct(x.alias("some_alias"), > F.col("id").alias("second_alias")) > ).alias("new_array2") > ).printSchema() > root > |-- new_array2: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- some_alias: long (nullable = true) > | | |-- second_alias: long (nullable = true){code} > Bad case, when using explode, the alises inside the transformed column is > ignored, and `id` is kept instead of `second_alias`, and `x_17` is used > instead of `some_alias` > > > {code:java} > df.select( > F.explode("array1").alias("exploded"), > F.transform( > 'array2', > lambda x: F.struct(x.alias("some_alias"), > F.col("id").alias("second_alias")) > ).alias("new_array2") > ).printSchema() > root > |-- exploded: string (nullable = true) > |-- new_array2: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- x_17: long (nullable = true) > | | |-- id: long (nullable = true) {code} > > {code:scala} > import org.apache.spark.sql.functions._ > var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"), > array(lit(1), lit(2), lit(3)).as("my_array2")) > df2.select( > explode($"my_array").as("exploded"), > transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct") > ).printSchema > {code} > {noformat} > root > |-- exploded: integer (nullable = false) > |-- my_struct: array (nullable = false) > | |-- element: struct (containsNull = false) > | | |-- x_2: integer (nullable = false) > {noformat} > > When using the SQL API instead, it works fine > {code:java} > spark.sql( > """ > select explode(array1) as exploded, transform(array2, x-> struct(x as > some_alias, id as second_alias)) as array2 from {df} > """, df=df > ).printSchema() > root > |-- exploded: string (nullable = true) > |-- array2: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- some_alias: long (nullable = true) > | | |-- second_alias: long (nullable = true) {code} > > Workaround: for now, using F.named_struct can be used as a workaround -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org