hi, Yi,

Probably, I miss something though, we cannot just wrap the udf with
`if (isnull(x)) null else udf(knownnotnull(x))`?

On Fri, Mar 13, 2020 at 6:22 PM wuyi <yi...@databricks.com> wrote:

> Hi all, I'd like to raise a discussion here about null-handling of
> primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ].
>
> After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken
> because now we can't use reflection to get the parameter types of the Scala
> lambda.
> This leads to silent result changing, for example, with UDF defined as `val
> f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has
> different
> behavior between 2.4 and 3.0 when the input value of column x is null.
>
> Spark 2.4:  null
> Spark 3.0:  0
>
> Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend
> users
> to use the typed ones. However, recently I identified several valid use
> cases,
> e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema
> cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f:
> Function1[A1, RT]) ].
>
> There are 3 solutions:
> 1. find a way to get Scala lambda parameter types by reflection (I tried it
> very hard but has no luck. The Java SAM type is so dynamic)
> 2. support case class as the input of typed Scala UDF, so at least people
> can still deal with struct type input column with UDF
> 3. add a new variant of untyped Scala UDF which users can specify input
> types
>
> I'd like to see more feedbacks or ideas about how to move forward.
>
> Thanks,
> Yi Wu
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro

Reply via email to