Hi Shengkai,

Thanks for reviewing the proposal and for the thoughtful feedback.

   1.

   Metric hierarchy - Makes sense. I’ve updated the proposal to scope the
   UDF metric group under OperatorMetricGroup.
   2.

   Naming convention - UDF metrics will follow this pattern:
   <operator_name>.<udf_name>.<metric_name>. For example, a UDF named
   enrichUser in a MapOperator would have
   mapOperator.enrichUser.UDFprocessingTime. I've clarified this in the doc.
   3.

   UDFExceptionCount purpose - This metric signals to
   autosizer/auto-remediation systems (and users) that failures come from user
   code (UDF), not the platform. For example, in a Flink SQL job, if
UDFExceptionCount
   > 0 in the past minute, it flags user-side errors and helps avoid
   unnecessary retries or scaling. It also captures “soft” errors from
   try/catch in UDFs that don’t fail the job but degrade data quality.

Let me know if you have any further questions!

Best,
Weiqing

On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> wrote:

> I just have some questions:
>
> 1. The current metrics hierarchy shows that the UDF metric group belongs to
> the TaskMetricGroup. I think it would be better for the UDF metric group to
> belong to the OperatorMetricGroup instead, because a UDF might be used by
> multiple operators.
> 2. What are the naming conventions for UDF metrics? Could you provide an
> example? Do the metric name contains the UDF name?
> 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an
> exception, the job fails immediately. Why do we need to track this value?
>
> Best
> Shengkai
>
>
> Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道:
>
> > Hi all,
> >
> > I’d like to initiate a discussion about adding UDF metrics.
> >
> > *Motivation*
> >
> > User-defined functions (UDFs) are essential for custom logic in Flink
> jobs
> > but often act as black boxes, making debugging and performance tuning
> > difficult. When issues like high latency or frequent exceptions occur,
> it's
> > hard to pinpoint the root cause inside UDFs.
> >
> > Flink currently lacks built-in metrics for key UDF aspects such as
> > per-record processing time or exception count. This limits observability
> > and complicates:
> >
> >    - Debugging production issues
> >    - Performance tuning and resource allocation
> >    - Supplying reliable signals to autoscaling systems
> >
> > Introducing standard, opt-in UDF metrics will improve platform
> > observability and overall health.
> > Here’s the proposal document: Link
> > <
> >
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> > >
> >
> > Your feedback and ideas are welcome to refine this feature.
> >
> >
> > Thanks,
> > Weiqing
> >
>

Reply via email to