[jira] [Commented] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

Sakthi (Jira) Wed, 26 Feb 2025 22:12:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930987#comment-17930987
 ]


Sakthi commented on SPARK-48091:
--------------------------------

Worth noting that the issue is fixed in current main (master) branch:

Pyspark

 
{code:java}
>>> df.select(
...     F.transform(
...         'array2',
...         lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
...     ).alias("new_array2")
... ).printSchema()
root
 |-- new_array2: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- some_alias: long (nullable = true)
 |    |    |-- second_alias: long (nullable = true)
>>> df.select(
...     F.explode("array1").alias("exploded"),
...     F.transform(
...         'array2',
...         lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
...     ).alias("new_array2")
... ).printSchema()
root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- some_alias: long (nullable = true)
 |    |    |-- second_alias: long (nullable = true)
{code}
 


Scala

 

 
{code:java}
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1, Seq("a", "b"), Seq(2, 3, 4))).toDF("id", "array1", 
"array2")
scala> df.show(false)
+---+------+---------+
|id |array1|array2   |
+---+------+---------+
|1  |[a, b]|[2, 3, 4]|
+---+------+---------+
scala> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"), 
array(lit(1), lit(2), lit(3)).as("my_array2"))
scala> df2.select(
     |   explode($"my_array").as("exploded"),
     |   transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct")
     | ).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
root
 |-- exploded: integer (nullable = false)
 |-- my_struct: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- data: integer (nullable = false)
{code}
Let me know if things still don't look okay.

 

> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-48091
>                 URL: https://issues.apache.org/jira/browse/SPARK-48091
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.0, 3.5.0, 3.5.1
>         Environment: Scala 2.12.15, Python 3.10, 3.12, OSX 14.4 and 
> Databricks DBR 13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1 
>            Reporter: Ron Serruya
>            Priority: Minor
>              Labels: alias
>
> When using an `explode` function, and `transform` function in the same select 
> statement, aliases used inside the transformed column are ignored.
> This behavior only happens using the pyspark API and the scala API, but not 
> when using the SQL API
>  
> {code:java}
> from pyspark.sql import functions as F
> # Create the df
> df = spark.createDataFrame([
>     {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
> ]){code}
> Good case, where all aliases are used
>  
> {code:java}
> df.select(
>     F.transform(
>         'array2',
>         lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
>     ).alias("new_array2")
> ).printSchema() 
> root
>  |-- new_array2: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- some_alias: long (nullable = true)
>  |    |    |-- second_alias: long (nullable = true){code}
> Bad case, when using explode, the alises inside the transformed column is 
> ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
> instead of `some_alias`
>  
>  
> {code:java}
> df.select(
>     F.explode("array1").alias("exploded"),
>     F.transform(
>         'array2',
>         lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
>     ).alias("new_array2")
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- new_array2: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- x_17: long (nullable = true)
>  |    |    |-- id: long (nullable = true) {code}
>  
>  {code:scala}
> import org.apache.spark.sql.functions._
> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"), 
> array(lit(1), lit(2), lit(3)).as("my_array2"))
> df2.select(
>   explode($"my_array").as("exploded"),
>   transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct")
> ).printSchema
> {code}
> {noformat}
> root
>  |-- exploded: integer (nullable = false)
>  |-- my_struct: array (nullable = false)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- x_2: integer (nullable = false)
> {noformat}
>  
> When using the SQL API instead, it works fine
> {code:java}
> spark.sql(
>     """
>     select explode(array1) as exploded, transform(array2, x-> struct(x as 
> some_alias, id as second_alias)) as array2 from {df}
>     """, df=df
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- array2: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- some_alias: long (nullable = true)
>  |    |    |-- second_alias: long (nullable = true) {code}
>  
> Workaround: for now, using F.named_struct can be used as a workaround



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

Reply via email to