[DISCUSS] Expensive deterministic UDFs

Enrico Minack Thu, 07 Nov 2019 00:45:06 -0800

Hi all,

Running expensive deterministic UDFs that return complex types, followedby multiple references to those results cause Spark to evaluate the UDFmultiple times per row. This has been reported and discussed before:SPARK-18748 SPARK-17728


    val f: Int => Array[Int]
    val udfF = udf(f)
    df
      .select($"id", udfF($"id").as("array"))
      .select($"array"(0).as("array0"), $"array"(1).as("array1"))

A common approach to make Spark evaluate the UDF only once is to cachethe intermediate result right after projecting the UDF:


    df
      .select($"id", udfF($"id").as("array"))
      .cache()
      .select($"array"(0).as("array0"), $"array"(1).as("array1"))

There are scenarios where this intermediate result is too big for thecluster to cache. Also this is bad design.

The best approach is to mark the UDF as non-deterministic. Then Sparkoptimizes the query in a way that the UDF gets called only once per row,exactly what you want.


    val udfF = udf(f).asNondeterministic()

*However, stating a UDF is non-deterministic though it clearly isdeterministic is counter-intuitive and makes your code harder to read.*

Spark should provide a better way to flag the UDF. Calling it expensivewould be a better naming here.


    val udfF = udf(f).asExpensive()

I understand that deterministic is a notion that Expression provides,and there is no equivalent to expensive that is understood by theoptimizer. However, that asExpensive() could just set theScalaUDF.udfDeterministic =deterministic &&!expensive, which implementsthe best available approach behind a better naming.


What are your thoughts on asExpensive()?

Regards,
Enrico

[DISCUSS] Expensive deterministic UDFs

Reply via email to