Hi Weiqing,

>From your doc, the entrypoint for UDF calls in the codegen is
ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which
could be instrumented with metrics.  This works well for synchronous calls,
but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
https://github.com/apache/flink/pull/26567)?  Timing metrics would only
account for what it takes to call invokeAsync, not for the result to
complete (with a result or error from the future object).

There are appropriate places which can handle the async callbacks, but they
are in other locations.  Will you be able to support those as well?

Thanks,
Alan

On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> wrote:

> I just have some questions:
>
> 1. The current metrics hierarchy shows that the UDF metric group belongs to
> the TaskMetricGroup. I think it would be better for the UDF metric group to
> belong to the OperatorMetricGroup instead, because a UDF might be used by
> multiple operators.
> 2. What are the naming conventions for UDF metrics? Could you provide an
> example? Do the metric name contains the UDF name?
> 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an
> exception, the job fails immediately. Why do we need to track this value?
>
> Best
> Shengkai
>
>
> Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道:
>
> > Hi all,
> >
> > I’d like to initiate a discussion about adding UDF metrics.
> >
> > *Motivation*
> >
> > User-defined functions (UDFs) are essential for custom logic in Flink
> jobs
> > but often act as black boxes, making debugging and performance tuning
> > difficult. When issues like high latency or frequent exceptions occur,
> it's
> > hard to pinpoint the root cause inside UDFs.
> >
> > Flink currently lacks built-in metrics for key UDF aspects such as
> > per-record processing time or exception count. This limits observability
> > and complicates:
> >
> >    - Debugging production issues
> >    - Performance tuning and resource allocation
> >    - Supplying reliable signals to autoscaling systems
> >
> > Introducing standard, opt-in UDF metrics will improve platform
> > observability and overall health.
> > Here’s the proposal document: Link
> > <
> >
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> > >
> >
> > Your feedback and ideas are welcome to refine this feature.
> >
> >
> > Thanks,
> > Weiqing
> >
>

Reply via email to