[ https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803201#comment-17803201 ]
Piotr Nowojski commented on FLINK-33856: ---------------------------------------- Hi, I second that implementing this as metrics doesn't sound to be right/correct. [~hejufang001] , I wouldn't make this a subtask of the FLIP-384, but if needed a follow up. There are two things worth notting/discussing: * please check the discussion on the dev mailing list in FLIP-384 about the current limitations. Namely we are currently only creating a trace with a single span for the whole checkpoint. Also it's currently very sparsely populated with metrics. There were discussions plans about creating children spans per each subtask/task, to mimic the existing `CheckpointingMetrics` structure. Probably this FLIP requires that change. * once we have per subtask spans, or aggregated metrics as in [the recovery spans from FLIP-386|https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans] , we might not need some of the metrics, that you are proposing here? For example `writeRate` should be easily computed from the async duration / checkpointed state size? Anyway, I think FLIP will be required here. > Add metrics to monitor the interaction performance between task and external > storage system in the process of checkpoint making > ------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-33856 > URL: https://issues.apache.org/jira/browse/FLINK-33856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.18.0 > Reporter: Jufang He > Assignee: Jufang He > Priority: Major > Labels: pull-request-available > > When Flink makes a checkpoint, the interaction performance with the external > file system has a great impact on the overall time-consuming. Therefore, it > is easy to observe the bottleneck point by adding performance indicators when > the task interacts with the external file storage system. These include: the > rate of file write , the latency to write the file, the latency to close the > file. > In flink side add the above metrics has the following advantages: convenient > statistical different task E2E time-consuming; do not need to distinguish the > type of external storage system, can be unified in the > FsCheckpointStreamFactory. -- This message was sent by Atlassian Jira (v8.20.10#820010)