[ https://issues.apache.org/jira/browse/FLINK-36679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896756#comment-17896756 ]
Mingliang Liu commented on FLINK-36679: --------------------------------------- We have a in-progress patch in our internal Flink light-fork contributed by my colleague Julian Jaffe. I can port that as a draft PR against `master` branch for discussion. > Add a metric to track checkpoint _metadata size > ----------------------------------------------- > > Key: FLINK-36679 > URL: https://issues.apache.org/jira/browse/FLINK-36679 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.18.1, 1.20.0, 1.19.1 > Reporter: Mingliang Liu > Priority: Major > > Currently we expose multiple metrics for the checkpoint size. One specific > interesting data point is the `_metadata` file size, which can also be added > as a metric. The `_metadata` file has multiple types of data to store > including operator states, coordinator states and properties. Its size should > be scoped to a reasonable range, otherwise job may take too long to restore > from checkpoints and/or fail to start when its size exceeding RPC frame limit. > However, we saw multiple times the `_metadata` file bloats causing job slow > and/or fail to start. In FLINK-32658 community reported similar problems. > Tracking the metadata size can be helpful for operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)