Hi all,

I’d like to initiate a discussion about adding UDF metrics.

*Motivation*

User-defined functions (UDFs) are essential for custom logic in Flink jobs
but often act as black boxes, making debugging and performance tuning
difficult. When issues like high latency or frequent exceptions occur, it's
hard to pinpoint the root cause inside UDFs.

Flink currently lacks built-in metrics for key UDF aspects such as
per-record processing time or exception count. This limits observability
and complicates:

   - Debugging production issues
   - Performance tuning and resource allocation
   - Supplying reliable signals to autoscaling systems

Introducing standard, opt-in UDF metrics will improve platform
observability and overall health.
Here’s the proposal document: Link
<https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1>

Your feedback and ideas are welcome to refine this feature.


Thanks,
Weiqing

Reply via email to