I just have some questions:

1. The current metrics hierarchy shows that the UDF metric group belongs to
the TaskMetricGroup. I think it would be better for the UDF metric group to
belong to the OperatorMetricGroup instead, because a UDF might be used by
multiple operators.
2. What are the naming conventions for UDF metrics? Could you provide an
example? Do the metric name contains the UDF name?
3. Why is the UDFExceptionCount metric introduced? If a UDF throws an
exception, the job fails immediately. Why do we need to track this value?

Best
Shengkai


Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道:

> Hi all,
>
> I’d like to initiate a discussion about adding UDF metrics.
>
> *Motivation*
>
> User-defined functions (UDFs) are essential for custom logic in Flink jobs
> but often act as black boxes, making debugging and performance tuning
> difficult. When issues like high latency or frequent exceptions occur, it's
> hard to pinpoint the root cause inside UDFs.
>
> Flink currently lacks built-in metrics for key UDF aspects such as
> per-record processing time or exception count. This limits observability
> and complicates:
>
>    - Debugging production issues
>    - Performance tuning and resource allocation
>    - Supplying reliable signals to autoscaling systems
>
> Introducing standard, opt-in UDF metrics will improve platform
> observability and overall health.
> Here’s the proposal document: Link
> <
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> >
>
> Your feedback and ideas are welcome to refine this feature.
>
>
> Thanks,
> Weiqing
>

Reply via email to