Hello everyone,
Consider this toy example:
case class Foo(x: String, y: String)
val df = sparkSession.createDataFrame(Array(Foo(null), Foo("a"), Foo("b"))
df.select(collect_set($"x")).show
In Spark 2.0.0 I get the following results:
+--------------+
|collect_set(x)|
+--------------+
| [null, b, a]|
+--------------+
In 1.6.* the same collect_set produces:
+--------------+
|collect_set(x)|
+--------------+
| [b, a]|
+--------------+
Is there any way to get this aggregation to ignore nulls? I understand the
trivial way would be to filter on x beforehand, but in my actual use case
I'm calling the collect_set in withColumn over a window specification, so I
want empty arrays on rows with nulls.
For now I'm using this hack of a workaround:
val removenulls = udf((l: scala.collection.mutable.WrappedArray[String]) =>
l.filter(x=>x != null))
f.select(removenulls(collect_set($"x"))).show
Any suggestions are appreciated.
Thanks,
Lee