Re: Re: [DISCUSS] Add UDF Metrics

Weiqing Yang Tue, 03 Mar 2026 14:59:17 -0800

Hi Zakelly,


Thanks for the feedback and sorry for the late response - I am now picking
it back up.

You raised a great point about the performance overhead, referencing
FLINK-16444 <https://issues.apache.org/jira/browse/FLINK-16444>. I've
updated the FLIP to adopt the same counter-based sampling approach used by
Flink's state latency tracking (FLINK-21736
<https://issues.apache.org/jira/browse/FLINK-21736>). Specifically:

  1. New config: table.exec.udf-metric.sample-interval (default: 100 [1]) -
only every Nth invocation is measured
  2. Fast path: Non-sampled invocations are a single integer increment -
negligible overhead
  3. Sampled path: System.nanoTime() around the UDF call, stored in a
DescriptiveStatisticsHistogram with a bounded 128-entry circular buffer [2]
  4. Metric type change: udfProcessingTime is now a Histogram (reports
p50/p75/p95/p99/mean/min/max) instead of the original Gauge
  5. Exception counting: Not sampled, since exceptions are rare events and
counting each one has negligible cost

Combined with the existing feature gate (table.exec.udf-metric-enabled
defaulting to false), users have two layers of protection: the feature is
off by default, and when enabled, sampling keeps overhead minimal.
The updated FLIP is here: link
<https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1>

Would this address your concern? If so, it would be great to have your vote
on the vote thread [3].

[1] 100: state.latency-track.sample-interval default value

[2] 128: state.latency-track.history-size default value (line 55), which is
the circular buffer size for the DescriptiveStatisticsHistogram
[3] https://lists.apache.org/thread/d0sv36839p5h03t3okv89pco2jy6vbg3

Thanks,
Weiqing

On Thu, Aug 21, 2025 at 12:24 AM Zakelly Lan <[email protected]> wrote:

> Hi Weiqing,
>
> Sorry for the late reply. And I have one question:
>
> I'm wondering whether the UDF processing time is measured for every
> individual UDF invocation, with the average then reported, or if sampling
> is used instead? I'm concerned about the potential overhead if we measure
> every single invocation. We've encountered similar performance issues when
> implementing state latency tracking [1].
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-16444
>
> Best,
> Zakelly
>
> On Fri, Aug 15, 2025 at 5:04 AM Weiqing Yang <[email protected]>
> wrote:
>
> > Cool - I’ll proceed to start the VOTE.
> > Thanks!
> >
> > Weiqing
> >
> > On Thu, Aug 14, 2025 at 12:53 AM Shengkai Fang <[email protected]>
> wrote:
> >
> > > I don't have any more comments.
> > >
> > > Best,
> > > Shengkai
> > >
> > > Weiqing Yang <[email protected]> 于2025年8月14日周四 14:47写道：
> > >
> > > > Thanks, Shengkai. I’ve updated the proposal doc with the recommended
> > > > configuration name. Please let me know if you have any additional
> > > feedback.
> > > >
> > > > Best,
> > > > Weiqing
> > > >
> > > > On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <[email protected]>
> > wrote:
> > > >
> > > > > Sorry for the late response. I prefer to use
> > > > > `table.exec.udf-metric-enabled` as the option name.
> > > > >
> > > > > Best,
> > > > > Shengkai
> > > > >
> > > > > Weiqing Yang <[email protected]> 于2025年8月13日周三 23:54写道：
> > > > >
> > > > > > Hi Shengkai, Alan, Xuyang, and all,
> > > > > >
> > > > > > Since there have been no further objections, I’ll proceed to
> start
> > > the
> > > > > VOTE
> > > > > > on this proposal shortly.
> > > > > >
> > > > > > Thanks,
> > > > > > Weiqing
> > > > > >
> > > > > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <
> > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Shengkai, Alan and Xuyang,
> > > > > > >
> > > > > > > Just checking in - do you have any concerns or feedback?
> > > > > > >
> > > > > > > If there are no further objections from anyone, I’ll mark the
> > FLIP
> > > as
> > > > > > > ready for voting.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Weiqing
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <
> > > > [email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Xuyang,
> > > > > > >>
> > > > > > >> Thank you for reviewing the proposal!
> > > > > > >>
> > > > > > >> I’m planning to use: *udf.metrics.process-time* and
> > > > > > >> *udf.metrics.exception-count*. These follow the naming
> > convention
> > > > used
> > > > > > >> in Flink (e.g., RocksDB native metrics
> > > > > > >> <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics
> > > > > > >).
> > > > > > >> I’ve added these names to the proposal doc.
> > > > > > >>
> > > > > > >> Alternatively, I also considered:
> > > *metrics.udf.process-time.enabled*
> > > > > and
> > > > > > >> *metrics.udf.exception-count.enabled. *
> > > > > > >>
> > > > > > >> Happy to hear any feedback on which style might be more
> > > appropriate.
> > > > > > >>
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Weiqing
> > > > > > >>
> > > > > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]>
> > > wrote:
> > > > > > >>
> > > > > > >>> Hi, Weiqing.
> > > > > > >>>
> > > > > > >>> Thanks for driving to improve this. I just have one
> question. I
> > > > > notice
> > > > > > a
> > > > > > >>> new configuration is introduced in this flip. I just wonder
> > what
> > > > the
> > > > > > >>> configuration name is. Could you please include the full name
> > of
> > > > this
> > > > > > >>> configuration? (just similar to the other names in
> > > MetricOptions?)
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>>
> > > > > > >>>     Best！
> > > > > > >>>     Xuyang
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> 在 2025-07-13 12:03:59，"Weiqing Yang" <
> [email protected]
> > >
> > > > 写道：
> > > > > > >>> >Hi Alan,
> > > > > > >>> >
> > > > > > >>> >Thanks for reviewing the proposal and for highlighting the
> > > > > ASYNC_TABLE
> > > > > > >>> work.
> > > > > > >>> >
> > > > > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR
> and
> > > > > > >>> ASYNC_TABLE.
> > > > > > >>> >For async UDFs, the plan is to instrument both the
> > invokeAsync()
> > > > > call
> > > > > > >>> and
> > > > > > >>> >the async callback handler to measure the full end-to-end
> > > latency
> > > > > > until
> > > > > > >>> the
> > > > > > >>> >result or error is returned from the future.
> > > > > > >>> >
> > > > > > >>> >Let me know if you have any further questions or
> suggestions.
> > > > > > >>> >
> > > > > > >>> >Best,
> > > > > > >>> >Weiqing
> > > > > > >>> >
> > > > > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
> > > > > > >>> ><[email protected]> wrote:
> > > > > > >>> >
> > > > > > >>> >> Hi Weiqing,
> > > > > > >>> >>
> > > > > > >>> >> From your doc, the entrypoint for UDF calls in the codegen
> > is
> > > > > > >>> >> ExprCodeGenerator which should invoke
> > > > BridgingSqlFunctionCallGen,
> > > > > > >>> which
> > > > > > >>> >> could be instrumented with metrics.  This works well for
> > > > > synchronous
> > > > > > >>> calls,
> > > > > > >>> >> but what about ASYNC_SCALAR and the soon to be merged
> > > > ASYNC_TABLE
> > > > > (
> > > > > > >>> >> https://github.com/apache/flink/pull/26567)?  Timing
> > metrics
> > > > > would
> > > > > > >>> only
> > > > > > >>> >> account for what it takes to call invokeAsync, not for the
> > > > result
> > > > > to
> > > > > > >>> >> complete (with a result or error from the future object).
> > > > > > >>> >>
> > > > > > >>> >> There are appropriate places which can handle the async
> > > > callbacks,
> > > > > > >>> but they
> > > > > > >>> >> are in other locations.  Will you be able to support those
> > as
> > > > > well?
> > > > > > >>> >>
> > > > > > >>> >> Thanks,
> > > > > > >>> >> Alan
> > > > > > >>> >>
> > > > > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <
> > > [email protected]
> > > > >
> > > > > > >>> wrote:
> > > > > > >>> >>
> > > > > > >>> >> > I just have some questions:
> > > > > > >>> >> >
> > > > > > >>> >> > 1. The current metrics hierarchy shows that the UDF
> metric
> > > > group
> > > > > > >>> belongs
> > > > > > >>> >> to
> > > > > > >>> >> > the TaskMetricGroup. I think it would be better for the
> > UDF
> > > > > metric
> > > > > > >>> group
> > > > > > >>> >> to
> > > > > > >>> >> > belong to the OperatorMetricGroup instead, because a UDF
> > > might
> > > > > be
> > > > > > >>> used by
> > > > > > >>> >> > multiple operators.
> > > > > > >>> >> > 2. What are the naming conventions for UDF metrics?
> Could
> > > you
> > > > > > >>> provide an
> > > > > > >>> >> > example? Do the metric name contains the UDF name?
> > > > > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a
> > UDF
> > > > > throws
> > > > > > >>> an
> > > > > > >>> >> > exception, the job fails immediately. Why do we need to
> > > track
> > > > > this
> > > > > > >>> value?
> > > > > > >>> >> >
> > > > > > >>> >> > Best
> > > > > > >>> >> > Shengkai
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三
> > > 12:59写道：
> > > > > > >>> >> >
> > > > > > >>> >> > > Hi all,
> > > > > > >>> >> > >
> > > > > > >>> >> > > I’d like to initiate a discussion about adding UDF
> > > metrics.
> > > > > > >>> >> > >
> > > > > > >>> >> > > *Motivation*
> > > > > > >>> >> > >
> > > > > > >>> >> > > User-defined functions (UDFs) are essential for custom
> > > logic
> > > > > in
> > > > > > >>> Flink
> > > > > > >>> >> > jobs
> > > > > > >>> >> > > but often act as black boxes, making debugging and
> > > > performance
> > > > > > >>> tuning
> > > > > > >>> >> > > difficult. When issues like high latency or frequent
> > > > > exceptions
> > > > > > >>> occur,
> > > > > > >>> >> > it's
> > > > > > >>> >> > > hard to pinpoint the root cause inside UDFs.
> > > > > > >>> >> > >
> > > > > > >>> >> > > Flink currently lacks built-in metrics for key UDF
> > aspects
> > > > > such
> > > > > > as
> > > > > > >>> >> > > per-record processing time or exception count. This
> > limits
> > > > > > >>> >> observability
> > > > > > >>> >> > > and complicates:
> > > > > > >>> >> > >
> > > > > > >>> >> > >    - Debugging production issues
> > > > > > >>> >> > >    - Performance tuning and resource allocation
> > > > > > >>> >> > >    - Supplying reliable signals to autoscaling systems
> > > > > > >>> >> > >
> > > > > > >>> >> > > Introducing standard, opt-in UDF metrics will improve
> > > > platform
> > > > > > >>> >> > > observability and overall health.
> > > > > > >>> >> > > Here’s the proposal document: Link
> > > > > > >>> >> > > <
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> > > > > > >>> >> > > >
> > > > > > >>> >> > >
> > > > > > >>> >> > > Your feedback and ideas are welcome to refine this
> > > feature.
> > > > > > >>> >> > >
> > > > > > >>> >> > >
> > > > > > >>> >> > > Thanks,
> > > > > > >>> >> > > Weiqing
> > > > > > >>> >> > >
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Re: [DISCUSS] Add UDF Metrics

Reply via email to