Hi, Weiqing. Thanks for driving to improve this. I just have one question. I notice a new configuration is introduced in this flip. I just wonder what the configuration name is. Could you please include the full name of this configuration? (just similar to the other names in MetricOptions?)
-- Best! Xuyang 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道: >Hi Alan, > >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE work. > >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and ASYNC_TABLE. >For async UDFs, the plan is to instrument both the invokeAsync() call and >the async callback handler to measure the full end-to-end latency until the >result or error is returned from the future. > >Let me know if you have any further questions or suggestions. > >Best, >Weiqing > >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg ><asheinb...@confluent.io.invalid> wrote: > >> Hi Weiqing, >> >> From your doc, the entrypoint for UDF calls in the codegen is >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which >> could be instrumented with metrics. This works well for synchronous calls, >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE ( >> https://github.com/apache/flink/pull/26567)? Timing metrics would only >> account for what it takes to call invokeAsync, not for the result to >> complete (with a result or error from the future object). >> >> There are appropriate places which can handle the async callbacks, but they >> are in other locations. Will you be able to support those as well? >> >> Thanks, >> Alan >> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> wrote: >> >> > I just have some questions: >> > >> > 1. The current metrics hierarchy shows that the UDF metric group belongs >> to >> > the TaskMetricGroup. I think it would be better for the UDF metric group >> to >> > belong to the OperatorMetricGroup instead, because a UDF might be used by >> > multiple operators. >> > 2. What are the naming conventions for UDF metrics? Could you provide an >> > example? Do the metric name contains the UDF name? >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an >> > exception, the job fails immediately. Why do we need to track this value? >> > >> > Best >> > Shengkai >> > >> > >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: >> > >> > > Hi all, >> > > >> > > I’d like to initiate a discussion about adding UDF metrics. >> > > >> > > *Motivation* >> > > >> > > User-defined functions (UDFs) are essential for custom logic in Flink >> > jobs >> > > but often act as black boxes, making debugging and performance tuning >> > > difficult. When issues like high latency or frequent exceptions occur, >> > it's >> > > hard to pinpoint the root cause inside UDFs. >> > > >> > > Flink currently lacks built-in metrics for key UDF aspects such as >> > > per-record processing time or exception count. This limits >> observability >> > > and complicates: >> > > >> > > - Debugging production issues >> > > - Performance tuning and resource allocation >> > > - Supplying reliable signals to autoscaling systems >> > > >> > > Introducing standard, opt-in UDF metrics will improve platform >> > > observability and overall health. >> > > Here’s the proposal document: Link >> > > < >> > > >> > >> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 >> > > > >> > > >> > > Your feedback and ideas are welcome to refine this feature. >> > > >> > > >> > > Thanks, >> > > Weiqing >> > > >> > >>