Hi Shengkai, Thanks for reviewing the proposal and for the thoughtful feedback.
1. Metric hierarchy - Makes sense. I’ve updated the proposal to scope the UDF metric group under OperatorMetricGroup. 2. Naming convention - UDF metrics will follow this pattern: <operator_name>.<udf_name>.<metric_name>. For example, a UDF named enrichUser in a MapOperator would have mapOperator.enrichUser.UDFprocessingTime. I've clarified this in the doc. 3. UDFExceptionCount purpose - This metric signals to autosizer/auto-remediation systems (and users) that failures come from user code (UDF), not the platform. For example, in a Flink SQL job, if UDFExceptionCount > 0 in the past minute, it flags user-side errors and helps avoid unnecessary retries or scaling. It also captures “soft” errors from try/catch in UDFs that don’t fail the job but degrade data quality. Let me know if you have any further questions! Best, Weiqing On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> wrote: > I just have some questions: > > 1. The current metrics hierarchy shows that the UDF metric group belongs to > the TaskMetricGroup. I think it would be better for the UDF metric group to > belong to the OperatorMetricGroup instead, because a UDF might be used by > multiple operators. > 2. What are the naming conventions for UDF metrics? Could you provide an > example? Do the metric name contains the UDF name? > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an > exception, the job fails immediately. Why do we need to track this value? > > Best > Shengkai > > > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: > > > Hi all, > > > > I’d like to initiate a discussion about adding UDF metrics. > > > > *Motivation* > > > > User-defined functions (UDFs) are essential for custom logic in Flink > jobs > > but often act as black boxes, making debugging and performance tuning > > difficult. When issues like high latency or frequent exceptions occur, > it's > > hard to pinpoint the root cause inside UDFs. > > > > Flink currently lacks built-in metrics for key UDF aspects such as > > per-record processing time or exception count. This limits observability > > and complicates: > > > > - Debugging production issues > > - Performance tuning and resource allocation > > - Supplying reliable signals to autoscaling systems > > > > Introducing standard, opt-in UDF metrics will improve platform > > observability and overall health. > > Here’s the proposal document: Link > > < > > > https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 > > > > > > > Your feedback and ideas are welcome to refine this feature. > > > > > > Thanks, > > Weiqing > > >