Hi, quick question: what is your exact checkpointing configuration? In particular, what is your value for the maximum parallel checkpoints and the minimum time interval to wait between two checkpoints?
Best, Stefan > Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>: > > Hi all, > > Last weekend, my flink job's checkpoint start failing because of timeout. I > have no idea what happened, but I collect some informations about my cluster > and job. Hope someone can give me advices or hints about the problem that I > encountered. > > My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each has 4 > cores. These machines are ec2 spot instances. The job's parallelism is set as > 32, using rocksdb as state backend and s3 presto as checkpoint file system. > The state's size is nearly 15gb and still grows day-by-day. Normally, It > takes 1.5 mins to finish the whole checkpoint process. The timeout > configuration is set as 10 mins. > > <chk_snapshot.png> > > As the picture shows, not each subtask of checkpoint broke caused by timeout, > but each machine has ever broken for all its subtasks during last weekend. > Some machines recovered by themselves and some machines recovered after I > restarted them. > > I record logs, stack trace and snapshot for machine's status (CPU, IO, > Network, etc.) for both good and bad machine. If there is a need for more > informations, please let me know. Thanks in advance. > > Best Regards, > Tony Wei. > <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_log.log><good_tm_pic.png><good_tm_stack.log>