Hi all, I’d like to initiate a discussion about adding UDF metrics.
*Motivation* User-defined functions (UDFs) are essential for custom logic in Flink jobs but often act as black boxes, making debugging and performance tuning difficult. When issues like high latency or frequent exceptions occur, it's hard to pinpoint the root cause inside UDFs. Flink currently lacks built-in metrics for key UDF aspects such as per-record processing time or exception count. This limits observability and complicates: - Debugging production issues - Performance tuning and resource allocation - Supplying reliable signals to autoscaling systems Introducing standard, opt-in UDF metrics will improve platform observability and overall health. Here’s the proposal document: Link <https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1> Your feedback and ideas are welcome to refine this feature. Thanks, Weiqing