[PR] [SPARK-43131][SQL][WIP] Add labels to identify UDFs [spark]

via GitHub Mon, 03 Feb 2025 09:35:11 -0800


amahussein opened a new pull request, #49775:
URL: https://github.com/apache/spark/pull/49775


   Signed-off-by: Ahmed Hussein <ahuss...@apache.org>
   
   This PR adds a label to the UDFS expression to allow identifying the UDF 
expressions.
   
   Apache Spark's User-Defined Functions (UDFs) are critical for extending SQL 
functionality, but tracking their performance historically required manual 
effort. This PR introduces labels to UDF expressions for improved 
identification and performance analysis. Here's a structured breakdown:
   
   ### What changes were proposed in this pull request?
   
   - The proposed label format: `$name:UDF(${children.mkString(", ")})` 
Example: `plusOne:UDF(5)` for a UDF named `plusOne` with input `5`
   - By looking at the log-history, explain-string used to include a udf label 
`s"UDF:$name"` at some point 
https://github.com/apache/spark/commit/9f523d3192c71a728fd8a2a64f52bbc337f2f026 
, then it was removed in 
https://github.com/apache/spark/commit/fe3e34dda68fd54212df1dd01b8acb9a9bc6a0ad
   
   
   ### Why are the changes needed?
   
   - Purpose of UDF Labeling
     - Performance Analysis: Labels allow automated extraction of UDF usage 
metrics from Spark event logs, enabling cluster performance evaluation and 
optimization opportunities.
     - Debugging/Observability: Identifies UDFs in query plans and logs (e.g., 
EXPLAIN output), helping diagnose non-deterministic behavior or bottlenecks.
     - Version/Proprietary Tracking: Distinguishes custom UDFs from built-in 
functions, especially when using version-specific or proprietary implementations
   
   - Use Cases enbaled
     - Event Log Analysis: Aggregate UDF execution times across jobs to 
identify slow functions.
     - Query Plan Optimization: Detect redundant UDF calls in Spark SQL plans.
     - Compliance Auditing: Track usage of deprecated or unauthorized UDFs
   
   ### WIP Consideration:
   
   Get the community feedback regarding:
   
   - Label Format Preferences:
     - Alternatives like `UDF:$name(...)` vs. `name:UDF(...)`; or
     - To include the type of the UDF: `ScalaUDF`, `HiveUDF`, `PyUDF`..etc.
   - Including additional metadata (e.g., determinism flags).
   - Impact on Column Names:
     - Ensure labels don’t conflict with existing column aliases.
     - Avoid exposing sensitive UDF logic in production logs
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the UDF expression strings appear in a different format:
   
   ```scala
   
       val udf1 = spark.udf.register(udf1Name, (n: Int) => n + 1)
       val udf2 = spark.udf.register(udf2Name, (n: Int) => n * 1)
       val df = sql("SELECT myUdf1(myUdf2(1))")
       df.show()
   ```
   
   | myUdf1:UDF(myUdf2:UDF(1))|
   |--------|
   | 2 |
   
   
   Original it will show up as:
   
   | myUdf1(myUdf2(1))|
   |--------|
   | 2 |
   
   
   ### How was this patch tested?
   
   Updated the unit tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-43131][SQL][WIP] Add labels to identify UDFs [spark]

Reply via email to