[ https://issues.apache.org/jira/browse/FLINK-29545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614666#comment-17614666 ]
xiaogang zhou edited comment on FLINK-29545 at 10/9/22 9:11 AM: ---------------------------------------------------------------- 1, yes, I have debug this task for many times, every time consumer stop is when checkpoint is triggered. 2, I don't think processor is blocked at logCheckpointProcessingDelay, I mention it because some subtask can success and display checkpoint duration, others only shows n/a. (check the attache picture). And I found the normal subtask can call the function SubtaskCheckpointCoordinatorImpl# checkpointState at the source task in the dag. but the 'n/a' subtask only call StreamTask# triggerCheckpointAsync not sure why it did not run by mailbox executor. And I have 500 taskmanager, it's hard to judge I should dump which one's thread stack was (Author: zhoujira86): 1, yes, I have debug this task to many times, every time consumer stop is when checkpoint is triggered. 2, I don't think processor is blocked at logCheckpointProcessingDelay, I mention it because some subtask can success and display checkpoint duration, others only shows n/a. (check the attache picture). And I found the normal subtask can call the function SubtaskCheckpointCoordinatorImpl# checkpointState at the source task in the dag. but the 'n/a' subtask only call StreamTask# triggerCheckpointAsync not sure why it did not run by mailbox executor. And I have 500 taskmanager, it's hard to judge I should dump which one's thread stack > kafka consuming stop when trigger first checkpoint > -------------------------------------------------- > > Key: FLINK-29545 > URL: https://issues.apache.org/jira/browse/FLINK-29545 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Network > Affects Versions: 1.13.3 > Reporter: xiaogang zhou > Priority: Critical > Attachments: backpressure 100 busy 0.png, task acknowledge na.png, > task dag.png > > > the task dag is like attached file. when the task is started to consume from > earliest offset, it will stop when the first checkpoint triggers. > > is it normal?, for sink is busy 0 and the second operator has 100 backpressure > > and check the checkpoint summary, we can find some of the sub task is n/a. > I tried to debug this issue and found in the > triggerCheckpointAsync , the > triggerCheckpointAsyncInMailbox took a lot time to call > > > looks like this has something to do with > logCheckpointProcessingDelay, Has any fix on this issue? > > > can anybody help me on this issue? > > thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)