Re: Re: [DISCUSS] Add UDF Metrics

Zakelly Lan Thu, 21 Aug 2025 00:24:52 -0700

Hi Weiqing,

Sorry for the late reply. And I have one question:


I'm wondering whether the UDF processing time is measured for every
individual UDF invocation, with the average then reported, or if sampling
is used instead? I'm concerned about the potential overhead if we measure
every single invocation. We've encountered similar performance issues when
implementing state latency tracking [1].


[1] https://issues.apache.org/jira/browse/FLINK-16444

Best,
Zakelly

On Fri, Aug 15, 2025 at 5:04 AM Weiqing Yang <[email protected]>
wrote:

> Cool - I’ll proceed to start the VOTE.
> Thanks!
>
> Weiqing
>
> On Thu, Aug 14, 2025 at 12:53 AM Shengkai Fang <[email protected]> wrote:
>
> > I don't have any more comments.
> >
> > Best,
> > Shengkai
> >
> > Weiqing Yang <[email protected]> 于2025年8月14日周四 14:47写道：
> >
> > > Thanks, Shengkai. I’ve updated the proposal doc with the recommended
> > > configuration name. Please let me know if you have any additional
> > feedback.
> > >
> > > Best,
> > > Weiqing
> > >
> > > On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <[email protected]>
> wrote:
> > >
> > > > Sorry for the late response. I prefer to use
> > > > `table.exec.udf-metric-enabled` as the option name.
> > > >
> > > > Best,
> > > > Shengkai
> > > >
> > > > Weiqing Yang <[email protected]> 于2025年8月13日周三 23:54写道：
> > > >
> > > > > Hi Shengkai, Alan, Xuyang, and all,
> > > > >
> > > > > Since there have been no further objections, I’ll proceed to start
> > the
> > > > VOTE
> > > > > on this proposal shortly.
> > > > >
> > > > > Thanks,
> > > > > Weiqing
> > > > >
> > > > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Shengkai, Alan and Xuyang,
> > > > > >
> > > > > > Just checking in - do you have any concerns or feedback?
> > > > > >
> > > > > > If there are no further objections from anyone, I’ll mark the
> FLIP
> > as
> > > > > > ready for voting.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Weiqing
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <
> > > [email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Xuyang,
> > > > > >>
> > > > > >> Thank you for reviewing the proposal!
> > > > > >>
> > > > > >> I’m planning to use: *udf.metrics.process-time* and
> > > > > >> *udf.metrics.exception-count*. These follow the naming
> convention
> > > used
> > > > > >> in Flink (e.g., RocksDB native metrics
> > > > > >> <
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics
> > > > > >).
> > > > > >> I’ve added these names to the proposal doc.
> > > > > >>
> > > > > >> Alternatively, I also considered:
> > *metrics.udf.process-time.enabled*
> > > > and
> > > > > >> *metrics.udf.exception-count.enabled. *
> > > > > >>
> > > > > >> Happy to hear any feedback on which style might be more
> > appropriate.
> > > > > >>
> > > > > >>
> > > > > >> Best,
> > > > > >> Weiqing
> > > > > >>
> > > > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]>
> > wrote:
> > > > > >>
> > > > > >>> Hi, Weiqing.
> > > > > >>>
> > > > > >>> Thanks for driving to improve this. I just have one question. I
> > > > notice
> > > > > a
> > > > > >>> new configuration is introduced in this flip. I just wonder
> what
> > > the
> > > > > >>> configuration name is. Could you please include the full name
> of
> > > this
> > > > > >>> configuration? (just similar to the other names in
> > MetricOptions?)
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>>
> > > > > >>>     Best！
> > > > > >>>     Xuyang
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> 在 2025-07-13 12:03:59，"Weiqing Yang" <[email protected]
> >
> > > 写道：
> > > > > >>> >Hi Alan,
> > > > > >>> >
> > > > > >>> >Thanks for reviewing the proposal and for highlighting the
> > > > ASYNC_TABLE
> > > > > >>> work.
> > > > > >>> >
> > > > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and
> > > > > >>> ASYNC_TABLE.
> > > > > >>> >For async UDFs, the plan is to instrument both the
> invokeAsync()
> > > > call
> > > > > >>> and
> > > > > >>> >the async callback handler to measure the full end-to-end
> > latency
> > > > > until
> > > > > >>> the
> > > > > >>> >result or error is returned from the future.
> > > > > >>> >
> > > > > >>> >Let me know if you have any further questions or suggestions.
> > > > > >>> >
> > > > > >>> >Best,
> > > > > >>> >Weiqing
> > > > > >>> >
> > > > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
> > > > > >>> ><[email protected]> wrote:
> > > > > >>> >
> > > > > >>> >> Hi Weiqing,
> > > > > >>> >>
> > > > > >>> >> From your doc, the entrypoint for UDF calls in the codegen
> is
> > > > > >>> >> ExprCodeGenerator which should invoke
> > > BridgingSqlFunctionCallGen,
> > > > > >>> which
> > > > > >>> >> could be instrumented with metrics.  This works well for
> > > > synchronous
> > > > > >>> calls,
> > > > > >>> >> but what about ASYNC_SCALAR and the soon to be merged
> > > ASYNC_TABLE
> > > > (
> > > > > >>> >> https://github.com/apache/flink/pull/26567)?  Timing
> metrics
> > > > would
> > > > > >>> only
> > > > > >>> >> account for what it takes to call invokeAsync, not for the
> > > result
> > > > to
> > > > > >>> >> complete (with a result or error from the future object).
> > > > > >>> >>
> > > > > >>> >> There are appropriate places which can handle the async
> > > callbacks,
> > > > > >>> but they
> > > > > >>> >> are in other locations.  Will you be able to support those
> as
> > > > well?
> > > > > >>> >>
> > > > > >>> >> Thanks,
> > > > > >>> >> Alan
> > > > > >>> >>
> > > > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <
> > [email protected]
> > > >
> > > > > >>> wrote:
> > > > > >>> >>
> > > > > >>> >> > I just have some questions:
> > > > > >>> >> >
> > > > > >>> >> > 1. The current metrics hierarchy shows that the UDF metric
> > > group
> > > > > >>> belongs
> > > > > >>> >> to
> > > > > >>> >> > the TaskMetricGroup. I think it would be better for the
> UDF
> > > > metric
> > > > > >>> group
> > > > > >>> >> to
> > > > > >>> >> > belong to the OperatorMetricGroup instead, because a UDF
> > might
> > > > be
> > > > > >>> used by
> > > > > >>> >> > multiple operators.
> > > > > >>> >> > 2. What are the naming conventions for UDF metrics? Could
> > you
> > > > > >>> provide an
> > > > > >>> >> > example? Do the metric name contains the UDF name?
> > > > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a
> UDF
> > > > throws
> > > > > >>> an
> > > > > >>> >> > exception, the job fails immediately. Why do we need to
> > track
> > > > this
> > > > > >>> value?
> > > > > >>> >> >
> > > > > >>> >> > Best
> > > > > >>> >> > Shengkai
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三
> > 12:59写道：
> > > > > >>> >> >
> > > > > >>> >> > > Hi all,
> > > > > >>> >> > >
> > > > > >>> >> > > I’d like to initiate a discussion about adding UDF
> > metrics.
> > > > > >>> >> > >
> > > > > >>> >> > > *Motivation*
> > > > > >>> >> > >
> > > > > >>> >> > > User-defined functions (UDFs) are essential for custom
> > logic
> > > > in
> > > > > >>> Flink
> > > > > >>> >> > jobs
> > > > > >>> >> > > but often act as black boxes, making debugging and
> > > performance
> > > > > >>> tuning
> > > > > >>> >> > > difficult. When issues like high latency or frequent
> > > > exceptions
> > > > > >>> occur,
> > > > > >>> >> > it's
> > > > > >>> >> > > hard to pinpoint the root cause inside UDFs.
> > > > > >>> >> > >
> > > > > >>> >> > > Flink currently lacks built-in metrics for key UDF
> aspects
> > > > such
> > > > > as
> > > > > >>> >> > > per-record processing time or exception count. This
> limits
> > > > > >>> >> observability
> > > > > >>> >> > > and complicates:
> > > > > >>> >> > >
> > > > > >>> >> > >    - Debugging production issues
> > > > > >>> >> > >    - Performance tuning and resource allocation
> > > > > >>> >> > >    - Supplying reliable signals to autoscaling systems
> > > > > >>> >> > >
> > > > > >>> >> > > Introducing standard, opt-in UDF metrics will improve
> > > platform
> > > > > >>> >> > > observability and overall health.
> > > > > >>> >> > > Here’s the proposal document: Link
> > > > > >>> >> > > <
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> > > > > >>> >> > > >
> > > > > >>> >> > >
> > > > > >>> >> > > Your feedback and ideas are welcome to refine this
> > feature.
> > > > > >>> >> > >
> > > > > >>> >> > >
> > > > > >>> >> > > Thanks,
> > > > > >>> >> > > Weiqing
> > > > > >>> >> > >
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Re: [DISCUSS] Add UDF Metrics

Reply via email to