Hi, Bekir First, The e2e time for a sub task is the $ack_time_received_in_JM - $trigger_time_in_JM. And checkpoint includes some steps on task side such as 1) receive first barrier; 2) barrier align(for exactly once); 3) operator snapshot sync part; 4) operator snapshot async part, the images you shared yesterday show that the sync part took a too long time, now the sync part and async part took some time long, and e2e time is much longer than sync_time + async_time. 1. you can checkpoint whether your job has backpressure problems(backpressure may lead the barrier flows too slowly to the downside task.), if it has such a problem, you should better solve it first. 2. If do not have a backpressure problem, you can check the `Alignment Duration` to see if the barriers align took a too long time. 3. for sync part, maybe you can checkpoint the disk performance(if there did not have the metric, you can find the `sar` log in your machine) 4. for the async part, we can check the network performance(or some client network flow control)
Hope this can help you. Best, Congxian Bekir Oguz <bekir.o...@persgroep.net> 于2019年7月18日周四 下午6:05写道: > Hi Congxian, > Starting from this morning we have more issues with checkpointing in > production. What we see is sync and async duration for some subtasks are > very long but what strange is the total of sync and async durations are > much less than the total end to end duration. Please check the following > snapshot: > > > For example, for the subtask 14: Sync duration is 4 mins, async duration 3 > mins, end-to-end duration is 53 mins!!! > We have a very long timeout value (1 hour) for checkpointing, but still > many checkpoints are failing, some subtasks cannot finish checkpointing in > 1 hour. > > We really appreciate your help here, this is a critical production problem > for us at the moment. > > Regards, > Bekir > > > On 17 Jul 2019, at 17:46, Bekir Oguz <bekir.o...@persgroep.net> wrote: > > > And I also extracted events fr > > >