hi, Yi, Probably, I miss something though, we cannot just wrap the udf with `if (isnull(x)) null else udf(knownnotnull(x))`?
On Fri, Mar 13, 2020 at 6:22 PM wuyi <yi...@databricks.com> wrote: > Hi all, I'd like to raise a discussion here about null-handling of > primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ]. > > After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken > because now we can't use reflection to get the parameter types of the Scala > lambda. > This leads to silent result changing, for example, with UDF defined as `val > f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has > different > behavior between 2.4 and 3.0 when the input value of column x is null. > > Spark 2.4: null > Spark 3.0: 0 > > Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend > users > to use the typed ones. However, recently I identified several valid use > cases, > e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema > cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f: > Function1[A1, RT]) ]. > > There are 3 solutions: > 1. find a way to get Scala lambda parameter types by reflection (I tried it > very hard but has no luck. The Java SAM type is so dynamic) > 2. support case class as the input of typed Scala UDF, so at least people > can still deal with struct type input column with UDF > 3. add a new variant of untyped Scala UDF which users can specify input > types > > I'd like to see more feedbacks or ideas about how to move forward. > > Thanks, > Yi Wu > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro