Mingliang Liu created FLINK-36679: ------------------------------------- Summary: Add a metric to track checkpoint _metadata size Key: FLINK-36679 URL: https://issues.apache.org/jira/browse/FLINK-36679 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.19.1, 1.20.0, 1.18.1 Reporter: Mingliang Liu
Currently we expose multiple metrics for the checkpoint size. One specific interesting data point is the `_metadata` file size, which can also be added as a metric. The `_metadata` file has multiple types of data to store including operator states, coordinator states and properties. Its size should be scoped to a reasonable range, otherwise job may take too long to restore from checkpoints and/or fail to start when its size exceeding RPC frame limit. However, we saw multiple times the `_metadata` file bloats causing job slow and/or fail to start. In FLINK-32658 community reported similar problems. Tracking the metadata size can be helpful for operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)