Re:Re: [DISCUSS] Add UDF Metrics

Xuyang Mon, 14 Jul 2025 02:55:16 -0700

Hi, Weiqing.

Thanks for driving to improve this. I just have one question. I notice a new 
configuration is introduced in this flip. I just wonder what the configuration 
name is. Could you please include the full name of this configuration? (just 
similar to the other names in MetricOptions?)





--

    Best！
    Xuyang





在 2025-07-13 12:03:59，"Weiqing Yang" <[email protected]> 写道：
>Hi Alan,
>
>Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE work.
>
>Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and ASYNC_TABLE.
>For async UDFs, the plan is to instrument both the invokeAsync() call and
>the async callback handler to measure the full end-to-end latency until the
>result or error is returned from the future.
>
>Let me know if you have any further questions or suggestions.
>
>Best,
>Weiqing
>
>On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
><[email protected]> wrote:
>
>> Hi Weiqing,
>>
>> From your doc, the entrypoint for UDF calls in the codegen is
>> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which
>> could be instrumented with metrics.  This works well for synchronous calls,
>> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
>> https://github.com/apache/flink/pull/26567)?  Timing metrics would only
>> account for what it takes to call invokeAsync, not for the result to
>> complete (with a result or error from the future object).
>>
>> There are appropriate places which can handle the async callbacks, but they
>> are in other locations.  Will you be able to support those as well?
>>
>> Thanks,
>> Alan
>>
>> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <[email protected]> wrote:
>>
>> > I just have some questions:
>> >
>> > 1. The current metrics hierarchy shows that the UDF metric group belongs
>> to
>> > the TaskMetricGroup. I think it would be better for the UDF metric group
>> to
>> > belong to the OperatorMetricGroup instead, because a UDF might be used by
>> > multiple operators.
>> > 2. What are the naming conventions for UDF metrics? Could you provide an
>> > example? Do the metric name contains the UDF name?
>> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an
>> > exception, the job fails immediately. Why do we need to track this value?
>> >
>> > Best
>> > Shengkai
>> >
>> >
>> > Weiqing Yang <[email protected]> 于2025年7月9日周三 12:59写道：
>> >
>> > > Hi all,
>> > >
>> > > I’d like to initiate a discussion about adding UDF metrics.
>> > >
>> > > *Motivation*
>> > >
>> > > User-defined functions (UDFs) are essential for custom logic in Flink
>> > jobs
>> > > but often act as black boxes, making debugging and performance tuning
>> > > difficult. When issues like high latency or frequent exceptions occur,
>> > it's
>> > > hard to pinpoint the root cause inside UDFs.
>> > >
>> > > Flink currently lacks built-in metrics for key UDF aspects such as
>> > > per-record processing time or exception count. This limits
>> observability
>> > > and complicates:
>> > >
>> > >    - Debugging production issues
>> > >    - Performance tuning and resource allocation
>> > >    - Supplying reliable signals to autoscaling systems
>> > >
>> > > Introducing standard, opt-in UDF metrics will improve platform
>> > > observability and overall health.
>> > > Here’s the proposal document: Link
>> > > <
>> > >
>> >
>> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
>> > > >
>> > >
>> > > Your feedback and ideas are welcome to refine this feature.
>> > >
>> > >
>> > > Thanks,
>> > > Weiqing
>> > >
>> >
>>

Re:Re: [DISCUSS] Add UDF Metrics

Reply via email to