Hi Sihua, Thanks for your suggestion. "incremental checkpoint" is what I will try out next and I know it will give a better performance. However, it might not solve this issue completely, because as I said, the average end to end latency of checkpointing is less than 1.5 mins currently, and it is far from my timeout configuration. I believe "incremental checkpoint" will reduce the latency and make this issue might occur seldom, but I can't promise it won't happen again if I have bigger states growth in the future. Am I right?
Best Regards, Tony Wei 2018-03-06 10:55 GMT+08:00 周思华 <summerle...@163.com>: > Hi Tony, > > Sorry for jump into, one thing I want to remind is that from the log you > provided it looks like you are using "full checkpoint", this means that the > state data that need to be checkpointed and transvered to s3 will grow over > time, and even for the first checkpoint it performance is slower that > incremental checkpoint (because it need to iterate all the record from the > rocksdb using the RocksDBMergeIterator). Maybe you can try out "incremental > checkpoint", it could help you got a better performance. > > Best Regards, > Sihua Zhou > > 发自网易邮箱大师 > > On 03/6/2018 10:34,Tony Wei<tony19920...@gmail.com> > <tony19920...@gmail.com> wrote: > > Hi Stefan, > > I see. That explains why the loading of machines grew up. However, I think > it is not the root cause that led to these consecutive checkpoint timeout. > As I said in my first mail, the checkpointing progress usually took 1.5 > mins to upload states, and this operator and kafka consumer are only two > operators that have states in my pipeline. In the best case, I should never > encounter the timeout problem that only caused by lots of pending > checkpointing threads that have already timed out. Am I right? > > Since these logging and stack trace was taken after nearly 3 hours from > the first checkpoint timeout, I'm afraid that we couldn't actually find out > the root cause for the first checkpoint timeout. Because we are preparing > to make this pipeline go on production, I was wondering if you could help > me find out where the root cause happened: bad machines or s3 or > flink-s3-presto packages or flink checkpointing thread. It will be great if > we can find it out from those informations the I provided, or a > hypothesis based on your experience is welcome as well. The most important > thing is that I have to decide whether I need to change my persistence > filesystem or use another s3 filesystem package, because it is the last > thing I want to see that the checkpoint timeout happened very often. > > Thank you very much for all your advices. > > Best Regards, > Tony Wei > > 2018-03-06 1:07 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>: > >> Hi, >> >> thanks for all the info. I had a look into the problem and opened >> https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your >> stack trace, you can see many checkpointing threads are running on your TM >> for checkpoints that have already timed out, and I think this cascades and >> slows down everything. Seems like the implementation of some features like >> checkpoint timeouts and not failing tasks from checkpointing problems >> overlooked that we also require to properly communicate that checkpoint >> cancellation to all task, which was not needed before. >> >> Best, >> Stefan >> >> >> Am 05.03.2018 um 14:42 schrieb Tony Wei <tony19920...@gmail.com>: >> >> Hi Stefan, >> >> Here is my checkpointing configuration. >> >> Checkpointing Mode Exactly Once >> Interval 20m 0s >> Timeout 10m 0s >> Minimum Pause Between Checkpoints 0ms >> Maximum Concurrent Checkpoints 1 >> Persist Checkpoints Externally Enabled (delete on cancellation) >> Best Regards, >> Tony Wei >> >> 2018-03-05 21:30 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>: >> >>> Hi, >>> >>> quick question: what is your exact checkpointing configuration? In >>> particular, what is your value for the maximum parallel checkpoints and the >>> minimum time interval to wait between two checkpoints? >>> >>> Best, >>> Stefan >>> >>> > Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>: >>> > >>> > Hi all, >>> > >>> > Last weekend, my flink job's checkpoint start failing because of >>> timeout. I have no idea what happened, but I collect some informations >>> about my cluster and job. Hope someone can give me advices or hints about >>> the problem that I encountered. >>> > >>> > My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each >>> has 4 cores. These machines are ec2 spot instances. The job's parallelism >>> is set as 32, using rocksdb as state backend and s3 presto as checkpoint >>> file system. >>> > The state's size is nearly 15gb and still grows day-by-day. Normally, >>> It takes 1.5 mins to finish the whole checkpoint process. The timeout >>> configuration is set as 10 mins. >>> > >>> > <chk_snapshot.png> >>> > >>> > As the picture shows, not each subtask of checkpoint broke caused by >>> timeout, but each machine has ever broken for all its subtasks during last >>> weekend. Some machines recovered by themselves and some machines recovered >>> after I restarted them. >>> > >>> > I record logs, stack trace and snapshot for machine's status (CPU, IO, >>> Network, etc.) for both good and bad machine. If there is a need for more >>> informations, please let me know. Thanks in advance. >>> > >>> > Best Regards, >>> > Tony Wei. >>> > <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_l >>> og.log><good_tm_pic.png><good_tm_stack.log> >>> >>> >> >> >