Mingliang Liu created FLINK-36679:
-------------------------------------

             Summary: Add a metric to track checkpoint _metadata size
                 Key: FLINK-36679
                 URL: https://issues.apache.org/jira/browse/FLINK-36679
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
    Affects Versions: 1.19.1, 1.20.0, 1.18.1
            Reporter: Mingliang Liu


Currently we expose multiple metrics for the checkpoint size. One specific 
interesting data point is the `_metadata` file size, which can also be added as 
a metric. The `_metadata` file has multiple types of data to store including 
operator states, coordinator states and properties. Its size should be scoped 
to a reasonable range, otherwise job may take too long to restore from 
checkpoints and/or fail to start when its size exceeding RPC frame limit.

However, we saw multiple times the `_metadata` file bloats causing job slow 
and/or fail to start. In  FLINK-32658 community reported similar problems.

Tracking the metadata size can be helpful for operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to