[ 
https://issues.apache.org/jira/browse/FLINK-36679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu updated FLINK-36679:
----------------------------------
    Description: 
Currently we expose multiple metrics for the checkpoint size. One specific 
interesting data point is the {{_metadata}} file size, which can also be added 
as a metric. The {{_metadata}} file has multiple types of data to store 
including operator states, coordinator states and properties. Its size should 
be scoped to a reasonable range, otherwise job may take too long to restore 
from checkpoints and/or fail to start when its size exceeding RPC frame limit.

However, we saw multiple times the {{_metadata}} file bloats up to 100MB~3GB 
causing job slow and/or fail to start. In user mail list (for e.g. 
[[1]|https://lists.apache.org/thread/dttjs2v412xd7slrrx94837ch8wjfo11], 
[[2]|https://lists.apache.org/thread/yj66dnbs7xrmbspdltq3yfptccm25llt]) and 
FLINK-32658 community reported similar problems.

Tracking the metadata size can be helpful for operations.

  was:
Currently we expose multiple metrics for the checkpoint size. One specific 
interesting data point is the {{_metadata}} file size, which can also be added 
as a metric. The {{_metadata}} file has multiple types of data to store 
including operator states, coordinator states and properties. Its size should 
be scoped to a reasonable range, otherwise job may take too long to restore 
from checkpoints and/or fail to start when its size exceeding RPC frame limit.

However, we saw multiple times the {{_metadata}} file bloats up to 100MB~3GB 
causing job slow and/or fail to start. In  FLINK-32658 community reported 
similar problems.

Tracking the metadata size can be helpful for operations.


> Add a metric to track checkpoint _metadata size
> -----------------------------------------------
>
>                 Key: FLINK-36679
>                 URL: https://issues.apache.org/jira/browse/FLINK-36679
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.18.1, 1.20.0, 1.19.1
>            Reporter: Mingliang Liu
>            Priority: Major
>              Labels: pull-request-available
>
> Currently we expose multiple metrics for the checkpoint size. One specific 
> interesting data point is the {{_metadata}} file size, which can also be 
> added as a metric. The {{_metadata}} file has multiple types of data to store 
> including operator states, coordinator states and properties. Its size should 
> be scoped to a reasonable range, otherwise job may take too long to restore 
> from checkpoints and/or fail to start when its size exceeding RPC frame limit.
> However, we saw multiple times the {{_metadata}} file bloats up to 100MB~3GB 
> causing job slow and/or fail to start. In user mail list (for e.g. 
> [[1]|https://lists.apache.org/thread/dttjs2v412xd7slrrx94837ch8wjfo11], 
> [[2]|https://lists.apache.org/thread/yj66dnbs7xrmbspdltq3yfptccm25llt]) and 
> FLINK-32658 community reported similar problems.
> Tracking the metadata size can be helpful for operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to