[ 
https://issues.apache.org/jira/browse/FLINK-29545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614666#comment-17614666
 ] 

xiaogang zhou edited comment on FLINK-29545 at 10/9/22 9:11 AM:
----------------------------------------------------------------

1, yes, I have debug this task for many times, every time consumer stop is when 
checkpoint is triggered. 

 

2, I don't think processor is blocked at logCheckpointProcessingDelay, I 
mention it because some subtask can success and display checkpoint duration, 
others only shows n/a. (check the attache picture). And I found the normal 
subtask can call the function    

SubtaskCheckpointCoordinatorImpl# checkpointState at the source task in the dag.

 

but the 'n/a' subtask only call 

StreamTask#  triggerCheckpointAsync

not sure why it did not run by mailbox executor.

 

And I have 500 taskmanager, it's hard to judge I should dump which one's thread 
stack

 


was (Author: zhoujira86):
1, yes, I have debug this task to many times, every time consumer stop is when 
checkpoint

is triggered. 

 

2, I don't think processor is blocked at logCheckpointProcessingDelay, I 
mention it because some subtask can success and display checkpoint duration, 
others only shows n/a. (check the attache picture). And I found the normal 
subtask can call the function    

SubtaskCheckpointCoordinatorImpl# checkpointState at the source task in the dag.

 

but the 'n/a' subtask only call 

StreamTask#  triggerCheckpointAsync

not sure why it did not run by mailbox executor.

 

And I have 500 taskmanager, it's hard to judge I should dump which one's thread 
stack

 

> kafka consuming stop when trigger first checkpoint
> --------------------------------------------------
>
>                 Key: FLINK-29545
>                 URL: https://issues.apache.org/jira/browse/FLINK-29545
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Network
>    Affects Versions: 1.13.3
>            Reporter: xiaogang zhou
>            Priority: Critical
>         Attachments: backpressure 100 busy 0.png, task acknowledge na.png, 
> task dag.png
>
>
> the task dag is like attached file. when the task is started to consume from 
> earliest offset, it will stop when the first checkpoint triggers.
>  
> is it normal?, for sink is busy 0 and the second operator has 100 backpressure
>  
> and check the checkpoint summary, we can find some of the sub task is n/a.
> I tried to debug this issue and found in the 
> triggerCheckpointAsync , the 
> triggerCheckpointAsyncInMailbox took  a lot time to call
>  
>  
> looks like this has something to do with 
> logCheckpointProcessingDelay, Has any fix on this issue?
>  
>  
> can anybody help me on this issue?
>  
> thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to