amahussein opened a new pull request, #49775: URL: https://github.com/apache/spark/pull/49775
Signed-off-by: Ahmed Hussein <ahuss...@apache.org> This PR adds a label to the UDFS expression to allow identifying the UDF expressions. Apache Spark's User-Defined Functions (UDFs) are critical for extending SQL functionality, but tracking their performance historically required manual effort. This PR introduces labels to UDF expressions for improved identification and performance analysis. Here's a structured breakdown: ### What changes were proposed in this pull request? - The proposed label format: `$name:UDF(${children.mkString(", ")})` Example: `plusOne:UDF(5)` for a UDF named `plusOne` with input `5` - By looking at the log-history, explain-string used to include a udf label `s"UDF:$name"` at some point https://github.com/apache/spark/commit/9f523d3192c71a728fd8a2a64f52bbc337f2f026 , then it was removed in https://github.com/apache/spark/commit/fe3e34dda68fd54212df1dd01b8acb9a9bc6a0ad ### Why are the changes needed? - Purpose of UDF Labeling - Performance Analysis: Labels allow automated extraction of UDF usage metrics from Spark event logs, enabling cluster performance evaluation and optimization opportunities. - Debugging/Observability: Identifies UDFs in query plans and logs (e.g., EXPLAIN output), helping diagnose non-deterministic behavior or bottlenecks. - Version/Proprietary Tracking: Distinguishes custom UDFs from built-in functions, especially when using version-specific or proprietary implementations - Use Cases enbaled - Event Log Analysis: Aggregate UDF execution times across jobs to identify slow functions. - Query Plan Optimization: Detect redundant UDF calls in Spark SQL plans. - Compliance Auditing: Track usage of deprecated or unauthorized UDFs ### WIP Consideration: Get the community feedback regarding: - Label Format Preferences: - Alternatives like `UDF:$name(...)` vs. `name:UDF(...)`; or - To include the type of the UDF: `ScalaUDF`, `HiveUDF`, `PyUDF`..etc. - Including additional metadata (e.g., determinism flags). - Impact on Column Names: - Ensure labels don’t conflict with existing column aliases. - Avoid exposing sensitive UDF logic in production logs ### Does this PR introduce _any_ user-facing change? Yes, the UDF expression strings appear in a different format: ```scala val udf1 = spark.udf.register(udf1Name, (n: Int) => n + 1) val udf2 = spark.udf.register(udf2Name, (n: Int) => n * 1) val df = sql("SELECT myUdf1(myUdf2(1))") df.show() ``` | myUdf1:UDF(myUdf2:UDF(1))| |--------| | 2 | Original it will show up as: | myUdf1(myUdf2(1))| |--------| | 2 | ### How was this patch tested? Updated the unit tests ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org