[ https://issues.apache.org/jira/browse/FLINK-36679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingliang Liu updated FLINK-36679: ---------------------------------- Description: Currently we expose multiple metrics for the checkpoint size. One specific interesting data point is the {{_metadata}} file size, which can also be added as a metric. The {{_metadata}} file has multiple types of data to store including operator states, coordinator states and properties. Its size should be scoped to a reasonable range, otherwise job may take too long to restore from checkpoints and/or fail to start when its size exceeding RPC frame limit. However, we saw multiple times the {{_metadata}} file bloats up to 100MB~3GB causing job slow and/or fail to start. In user mail list (for e.g. [[1]|https://lists.apache.org/thread/dttjs2v412xd7slrrx94837ch8wjfo11], [[2]|https://lists.apache.org/thread/yj66dnbs7xrmbspdltq3yfptccm25llt]) and FLINK-32658 community reported similar problems. Tracking the metadata size can be helpful for operations. was: Currently we expose multiple metrics for the checkpoint size. One specific interesting data point is the {{_metadata}} file size, which can also be added as a metric. The {{_metadata}} file has multiple types of data to store including operator states, coordinator states and properties. Its size should be scoped to a reasonable range, otherwise job may take too long to restore from checkpoints and/or fail to start when its size exceeding RPC frame limit. However, we saw multiple times the {{_metadata}} file bloats up to 100MB~3GB causing job slow and/or fail to start. In FLINK-32658 community reported similar problems. Tracking the metadata size can be helpful for operations. > Add a metric to track checkpoint _metadata size > ----------------------------------------------- > > Key: FLINK-36679 > URL: https://issues.apache.org/jira/browse/FLINK-36679 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.18.1, 1.20.0, 1.19.1 > Reporter: Mingliang Liu > Priority: Major > Labels: pull-request-available > > Currently we expose multiple metrics for the checkpoint size. One specific > interesting data point is the {{_metadata}} file size, which can also be > added as a metric. The {{_metadata}} file has multiple types of data to store > including operator states, coordinator states and properties. Its size should > be scoped to a reasonable range, otherwise job may take too long to restore > from checkpoints and/or fail to start when its size exceeding RPC frame limit. > However, we saw multiple times the {{_metadata}} file bloats up to 100MB~3GB > causing job slow and/or fail to start. In user mail list (for e.g. > [[1]|https://lists.apache.org/thread/dttjs2v412xd7slrrx94837ch8wjfo11], > [[2]|https://lists.apache.org/thread/yj66dnbs7xrmbspdltq3yfptccm25llt]) and > FLINK-32658 community reported similar problems. > Tracking the metadata size can be helpful for operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)