I just have some questions: 1. The current metrics hierarchy shows that the UDF metric group belongs to the TaskMetricGroup. I think it would be better for the UDF metric group to belong to the OperatorMetricGroup instead, because a UDF might be used by multiple operators. 2. What are the naming conventions for UDF metrics? Could you provide an example? Do the metric name contains the UDF name? 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an exception, the job fails immediately. Why do we need to track this value?
Best Shengkai Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: > Hi all, > > I’d like to initiate a discussion about adding UDF metrics. > > *Motivation* > > User-defined functions (UDFs) are essential for custom logic in Flink jobs > but often act as black boxes, making debugging and performance tuning > difficult. When issues like high latency or frequent exceptions occur, it's > hard to pinpoint the root cause inside UDFs. > > Flink currently lacks built-in metrics for key UDF aspects such as > per-record processing time or exception count. This limits observability > and complicates: > > - Debugging production issues > - Performance tuning and resource allocation > - Supplying reliable signals to autoscaling systems > > Introducing standard, opt-in UDF metrics will improve platform > observability and overall health. > Here’s the proposal document: Link > < > https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 > > > > Your feedback and ideas are welcome to refine this feature. > > > Thanks, > Weiqing >