Hi Stefan, Here is my checkpointing configuration.
Checkpointing Mode Exactly Once Interval 20m 0s Timeout 10m 0s Minimum Pause Between Checkpoints 0ms Maximum Concurrent Checkpoints 1 Persist Checkpoints Externally Enabled (delete on cancellation) Best Regards, Tony Wei 2018-03-05 21:30 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>: > Hi, > > quick question: what is your exact checkpointing configuration? In > particular, what is your value for the maximum parallel checkpoints and the > minimum time interval to wait between two checkpoints? > > Best, > Stefan > > > Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>: > > > > Hi all, > > > > Last weekend, my flink job's checkpoint start failing because of > timeout. I have no idea what happened, but I collect some informations > about my cluster and job. Hope someone can give me advices or hints about > the problem that I encountered. > > > > My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each has > 4 cores. These machines are ec2 spot instances. The job's parallelism is > set as 32, using rocksdb as state backend and s3 presto as checkpoint file > system. > > The state's size is nearly 15gb and still grows day-by-day. Normally, It > takes 1.5 mins to finish the whole checkpoint process. The timeout > configuration is set as 10 mins. > > > > <chk_snapshot.png> > > > > As the picture shows, not each subtask of checkpoint broke caused by > timeout, but each machine has ever broken for all its subtasks during last > weekend. Some machines recovered by themselves and some machines recovered > after I restarted them. > > > > I record logs, stack trace and snapshot for machine's status (CPU, IO, > Network, etc.) for both good and bad machine. If there is a need for more > informations, please let me know. Thanks in advance. > > > > Best Regards, > > Tony Wei. > > <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_ > tm_log.log><good_tm_pic.png><good_tm_stack.log> > >