[ https://issues.apache.org/jira/browse/FLINK-32957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833184#comment-17833184 ]
Piotr Nowojski commented on FLINK-32957: ---------------------------------------- {{mailboxLatencyMs}} shows basically the same thing AFAIK. That is sampled time how long things are waiting in the mailbox queue before being executed, and timers are fired via the mailbox. > Add current timer trigger lag to metrics > ---------------------------------------- > > Key: FLINK-32957 > URL: https://issues.apache.org/jira/browse/FLINK-32957 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics, Runtime / State Backends > Reporter: Rui Xia > Priority: Minor > > Timer trigger lag denotes the gap between the actual trigger timestamp and > the expected trigger timestamp (registered timestamp to `TimeService`). This > metric can aid users to find out whether there is a backlog of timers. > The backlog of timers may affect downstream data processing. Users customize > the trigger logic, which may interact with downstream data processing. For > example, a trigger logic can inject some records to downstream operators. The > backlog of timers blocks the record injection. > On the other side, The backlog of timers makes jobs unstable. Timers are used > by window operators, which leverage a timer to remove the window state of a > triggered window. The backlog of timers blocks data removal, and the state > size may grow unexpectedly large. The large state size affects the > performance of state-backend. In cloud-native environment, a k8s pod is prone > to reach local disk limit due to large state files (RocksDB SST). > Currently, users are hard to observe the backlog of timers. As far as I > known, heap dump is the only way to learn the backlog of timers. Thus, users > cannot notice the backlog of timers in time. FLINK-32954 > (https://issues.apache.org/jira/browse/FLINK-32954) exposes number of heap > timers, but is not suitable for RocksDB timer due to performance loss. > Compare with FLINK-32954, timer trigger lag is much more lightweight for > RocksDB timer. > * Reason 1: Timer trigger lag does not affect timer registering. > * Reason 2: The effect on timer triggering is limited. Timer registering is > a hot code-path, while timer triggering is much colder. In general, the > trigger interval is tens of second, and the timer trigger code-path is > invoked every tens of second. Thus, the addition of timer trigger lag > calculation has little performance overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010)