Hi Zakelly,
Thanks for the feedback and sorry for the late response - I am now picking it back up. You raised a great point about the performance overhead, referencing FLINK-16444 <https://issues.apache.org/jira/browse/FLINK-16444>. I've updated the FLIP to adopt the same counter-based sampling approach used by Flink's state latency tracking (FLINK-21736 <https://issues.apache.org/jira/browse/FLINK-21736>). Specifically: 1. New config: table.exec.udf-metric.sample-interval (default: 100 [1]) - only every Nth invocation is measured 2. Fast path: Non-sampled invocations are a single integer increment - negligible overhead 3. Sampled path: System.nanoTime() around the UDF call, stored in a DescriptiveStatisticsHistogram with a bounded 128-entry circular buffer [2] 4. Metric type change: udfProcessingTime is now a Histogram (reports p50/p75/p95/p99/mean/min/max) instead of the original Gauge 5. Exception counting: Not sampled, since exceptions are rare events and counting each one has negligible cost Combined with the existing feature gate (table.exec.udf-metric-enabled defaulting to false), users have two layers of protection: the feature is off by default, and when enabled, sampling keeps overhead minimal. The updated FLIP is here: link <https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1> Would this address your concern? If so, it would be great to have your vote on the vote thread [3]. [1] 100: state.latency-track.sample-interval default value [2] 128: state.latency-track.history-size default value (line 55), which is the circular buffer size for the DescriptiveStatisticsHistogram [3] https://lists.apache.org/thread/d0sv36839p5h03t3okv89pco2jy6vbg3 Thanks, Weiqing On Thu, Aug 21, 2025 at 12:24 AM Zakelly Lan <[email protected]> wrote: > Hi Weiqing, > > Sorry for the late reply. And I have one question: > > I'm wondering whether the UDF processing time is measured for every > individual UDF invocation, with the average then reported, or if sampling > is used instead? I'm concerned about the potential overhead if we measure > every single invocation. We've encountered similar performance issues when > implementing state latency tracking [1]. > > > [1] https://issues.apache.org/jira/browse/FLINK-16444 > > Best, > Zakelly > > On Fri, Aug 15, 2025 at 5:04 AM Weiqing Yang <[email protected]> > wrote: > > > Cool - I’ll proceed to start the VOTE. > > Thanks! > > > > Weiqing > > > > On Thu, Aug 14, 2025 at 12:53 AM Shengkai Fang <[email protected]> > wrote: > > > > > I don't have any more comments. > > > > > > Best, > > > Shengkai > > > > > > Weiqing Yang <[email protected]> 于2025年8月14日周四 14:47写道: > > > > > > > Thanks, Shengkai. I’ve updated the proposal doc with the recommended > > > > configuration name. Please let me know if you have any additional > > > feedback. > > > > > > > > Best, > > > > Weiqing > > > > > > > > On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <[email protected]> > > wrote: > > > > > > > > > Sorry for the late response. I prefer to use > > > > > `table.exec.udf-metric-enabled` as the option name. > > > > > > > > > > Best, > > > > > Shengkai > > > > > > > > > > Weiqing Yang <[email protected]> 于2025年8月13日周三 23:54写道: > > > > > > > > > > > Hi Shengkai, Alan, Xuyang, and all, > > > > > > > > > > > > Since there have been no further objections, I’ll proceed to > start > > > the > > > > > VOTE > > > > > > on this proposal shortly. > > > > > > > > > > > > Thanks, > > > > > > Weiqing > > > > > > > > > > > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang < > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hi Shengkai, Alan and Xuyang, > > > > > > > > > > > > > > Just checking in - do you have any concerns or feedback? > > > > > > > > > > > > > > If there are no further objections from anyone, I’ll mark the > > FLIP > > > as > > > > > > > ready for voting. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Weiqing > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang < > > > > [email protected] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > >> Hi Xuyang, > > > > > > >> > > > > > > >> Thank you for reviewing the proposal! > > > > > > >> > > > > > > >> I’m planning to use: *udf.metrics.process-time* and > > > > > > >> *udf.metrics.exception-count*. These follow the naming > > convention > > > > used > > > > > > >> in Flink (e.g., RocksDB native metrics > > > > > > >> < > > > > > > > > > > > > > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics > > > > > > >). > > > > > > >> I’ve added these names to the proposal doc. > > > > > > >> > > > > > > >> Alternatively, I also considered: > > > *metrics.udf.process-time.enabled* > > > > > and > > > > > > >> *metrics.udf.exception-count.enabled. * > > > > > > >> > > > > > > >> Happy to hear any feedback on which style might be more > > > appropriate. > > > > > > >> > > > > > > >> > > > > > > >> Best, > > > > > > >> Weiqing > > > > > > >> > > > > > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]> > > > wrote: > > > > > > >> > > > > > > >>> Hi, Weiqing. > > > > > > >>> > > > > > > >>> Thanks for driving to improve this. I just have one > question. I > > > > > notice > > > > > > a > > > > > > >>> new configuration is introduced in this flip. I just wonder > > what > > > > the > > > > > > >>> configuration name is. Could you please include the full name > > of > > > > this > > > > > > >>> configuration? (just similar to the other names in > > > MetricOptions?) > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> -- > > > > > > >>> > > > > > > >>> Best! > > > > > > >>> Xuyang > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> 在 2025-07-13 12:03:59,"Weiqing Yang" < > [email protected] > > > > > > > 写道: > > > > > > >>> >Hi Alan, > > > > > > >>> > > > > > > > >>> >Thanks for reviewing the proposal and for highlighting the > > > > > ASYNC_TABLE > > > > > > >>> work. > > > > > > >>> > > > > > > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR > and > > > > > > >>> ASYNC_TABLE. > > > > > > >>> >For async UDFs, the plan is to instrument both the > > invokeAsync() > > > > > call > > > > > > >>> and > > > > > > >>> >the async callback handler to measure the full end-to-end > > > latency > > > > > > until > > > > > > >>> the > > > > > > >>> >result or error is returned from the future. > > > > > > >>> > > > > > > > >>> >Let me know if you have any further questions or > suggestions. > > > > > > >>> > > > > > > > >>> >Best, > > > > > > >>> >Weiqing > > > > > > >>> > > > > > > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg > > > > > > >>> ><[email protected]> wrote: > > > > > > >>> > > > > > > > >>> >> Hi Weiqing, > > > > > > >>> >> > > > > > > >>> >> From your doc, the entrypoint for UDF calls in the codegen > > is > > > > > > >>> >> ExprCodeGenerator which should invoke > > > > BridgingSqlFunctionCallGen, > > > > > > >>> which > > > > > > >>> >> could be instrumented with metrics. This works well for > > > > > synchronous > > > > > > >>> calls, > > > > > > >>> >> but what about ASYNC_SCALAR and the soon to be merged > > > > ASYNC_TABLE > > > > > ( > > > > > > >>> >> https://github.com/apache/flink/pull/26567)? Timing > > metrics > > > > > would > > > > > > >>> only > > > > > > >>> >> account for what it takes to call invokeAsync, not for the > > > > result > > > > > to > > > > > > >>> >> complete (with a result or error from the future object). > > > > > > >>> >> > > > > > > >>> >> There are appropriate places which can handle the async > > > > callbacks, > > > > > > >>> but they > > > > > > >>> >> are in other locations. Will you be able to support those > > as > > > > > well? > > > > > > >>> >> > > > > > > >>> >> Thanks, > > > > > > >>> >> Alan > > > > > > >>> >> > > > > > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang < > > > [email protected] > > > > > > > > > > > >>> wrote: > > > > > > >>> >> > > > > > > >>> >> > I just have some questions: > > > > > > >>> >> > > > > > > > >>> >> > 1. The current metrics hierarchy shows that the UDF > metric > > > > group > > > > > > >>> belongs > > > > > > >>> >> to > > > > > > >>> >> > the TaskMetricGroup. I think it would be better for the > > UDF > > > > > metric > > > > > > >>> group > > > > > > >>> >> to > > > > > > >>> >> > belong to the OperatorMetricGroup instead, because a UDF > > > might > > > > > be > > > > > > >>> used by > > > > > > >>> >> > multiple operators. > > > > > > >>> >> > 2. What are the naming conventions for UDF metrics? > Could > > > you > > > > > > >>> provide an > > > > > > >>> >> > example? Do the metric name contains the UDF name? > > > > > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a > > UDF > > > > > throws > > > > > > >>> an > > > > > > >>> >> > exception, the job fails immediately. Why do we need to > > > track > > > > > this > > > > > > >>> value? > > > > > > >>> >> > > > > > > > >>> >> > Best > > > > > > >>> >> > Shengkai > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三 > > > 12:59写道: > > > > > > >>> >> > > > > > > > >>> >> > > Hi all, > > > > > > >>> >> > > > > > > > > >>> >> > > I’d like to initiate a discussion about adding UDF > > > metrics. > > > > > > >>> >> > > > > > > > > >>> >> > > *Motivation* > > > > > > >>> >> > > > > > > > > >>> >> > > User-defined functions (UDFs) are essential for custom > > > logic > > > > > in > > > > > > >>> Flink > > > > > > >>> >> > jobs > > > > > > >>> >> > > but often act as black boxes, making debugging and > > > > performance > > > > > > >>> tuning > > > > > > >>> >> > > difficult. When issues like high latency or frequent > > > > > exceptions > > > > > > >>> occur, > > > > > > >>> >> > it's > > > > > > >>> >> > > hard to pinpoint the root cause inside UDFs. > > > > > > >>> >> > > > > > > > > >>> >> > > Flink currently lacks built-in metrics for key UDF > > aspects > > > > > such > > > > > > as > > > > > > >>> >> > > per-record processing time or exception count. This > > limits > > > > > > >>> >> observability > > > > > > >>> >> > > and complicates: > > > > > > >>> >> > > > > > > > > >>> >> > > - Debugging production issues > > > > > > >>> >> > > - Performance tuning and resource allocation > > > > > > >>> >> > > - Supplying reliable signals to autoscaling systems > > > > > > >>> >> > > > > > > > > >>> >> > > Introducing standard, opt-in UDF metrics will improve > > > > platform > > > > > > >>> >> > > observability and overall health. > > > > > > >>> >> > > Here’s the proposal document: Link > > > > > > >>> >> > > < > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 > > > > > > >>> >> > > > > > > > > > >>> >> > > > > > > > > >>> >> > > Your feedback and ideas are welcome to refine this > > > feature. > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > > >>> >> > > Thanks, > > > > > > >>> >> > > Weiqing > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
