Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

Tony Wei Mon, 05 Mar 2018 21:03:03 -0800

Hi Sihua,

Thanks for your suggestion. "incremental checkpoint" is what I will try out
next and I know it will give a better performance. However, it might not
solve this issue completely, because as I said, the average end to end
latency of checkpointing is less than 1.5 mins currently, and it is far
from my timeout configuration. I believe "incremental checkpoint" will
reduce the latency and make this issue might occur seldom, but I can't
promise it won't happen again if I have bigger states growth in the future.
Am I right?


Best Regards,
Tony Wei

2018-03-06 10:55 GMT+08:00 周思华 <summerle...@163.com>:

> Hi Tony,
>
> Sorry for jump into, one thing I want to remind is that from the log you
> provided it looks like you are using "full checkpoint", this means that the
> state data that need to be checkpointed and transvered to s3 will grow over
> time, and even for the first checkpoint it performance is slower that
> incremental checkpoint (because it need to iterate all the record from the
> rocksdb using the RocksDBMergeIterator). Maybe you can try out "incremental
> checkpoint", it could help you got a better performance.
>
> Best Regards,
> Sihua Zhou
>
> 发自网易邮箱大师
>
> On 03/6/2018 10:34，Tony Wei<tony19920...@gmail.com>
> <tony19920...@gmail.com> wrote：
>
> Hi Stefan,
>
> I see. That explains why the loading of machines grew up. However, I think
> it is not the root cause that led to these consecutive checkpoint timeout.
> As I said in my first mail, the checkpointing progress usually took 1.5
> mins to upload states, and this operator and kafka consumer are only two
> operators that have states in my pipeline. In the best case, I should never
> encounter the timeout problem that only caused by lots of pending
> checkpointing threads that have already timed out. Am I right?
>
> Since these logging and stack trace was taken after nearly 3 hours from
> the first checkpoint timeout, I'm afraid that we couldn't actually find out
> the root cause for the first checkpoint timeout. Because we are preparing
> to make this pipeline go on production, I was wondering if you could help
> me find out where the root cause happened: bad machines or s3 or
> flink-s3-presto packages or flink checkpointing thread. It will be great if
> we can find it out from those informations the I provided, or a
> hypothesis based on your experience is welcome as well. The most important
> thing is that I have to decide whether I need to change my persistence
> filesystem or use another s3 filesystem package, because it is the last
> thing I want to see that the checkpoint timeout happened very often.
>
> Thank you very much for all your advices.
>
> Best Regards,
> Tony Wei
>
> 2018-03-06 1:07 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>:
>
>> Hi,
>>
>> thanks for all the info. I had a look into the problem and opened
>> https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your
>> stack trace, you can see many checkpointing threads are running on your TM
>> for checkpoints that have already timed out, and I think this cascades and
>> slows down everything. Seems like the implementation of some features like
>> checkpoint timeouts and not failing tasks from checkpointing problems
>> overlooked that we also require to properly communicate that checkpoint
>> cancellation to all task, which was not needed before.
>>
>> Best,
>> Stefan
>>
>>
>> Am 05.03.2018 um 14:42 schrieb Tony Wei <tony19920...@gmail.com>:
>>
>> Hi Stefan,
>>
>> Here is my checkpointing configuration.
>>
>> Checkpointing Mode Exactly Once
>> Interval 20m 0s
>> Timeout 10m 0s
>> Minimum Pause Between Checkpoints 0ms
>> Maximum Concurrent Checkpoints 1
>> Persist Checkpoints Externally Enabled (delete on cancellation)
>> Best Regards,
>> Tony Wei
>>
>> 2018-03-05 21:30 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>:
>>
>>> Hi,
>>>
>>> quick question: what is your exact checkpointing configuration? In
>>> particular, what is your value for the maximum parallel checkpoints and the
>>> minimum time interval to wait between two checkpoints?
>>>
>>> Best,
>>> Stefan
>>>
>>> > Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>:
>>> >
>>> > Hi all,
>>> >
>>> > Last weekend, my flink job's checkpoint start failing because of
>>> timeout. I have no idea what happened, but I collect some informations
>>> about my cluster and job. Hope someone can give me advices or hints about
>>> the problem that I encountered.
>>> >
>>> > My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each
>>> has 4 cores. These machines are ec2 spot instances. The job's parallelism
>>> is set as 32, using rocksdb as state backend and s3 presto as checkpoint
>>> file system.
>>> > The state's size is nearly 15gb and still grows day-by-day. Normally,
>>> It takes 1.5 mins to finish the whole checkpoint process. The timeout
>>> configuration is set as 10 mins.
>>> >
>>> > <chk_snapshot.png>
>>> >
>>> > As the picture shows, not each subtask of checkpoint broke caused by
>>> timeout, but each machine has ever broken for all its subtasks during last
>>> weekend. Some machines recovered by themselves and some machines recovered
>>> after I restarted them.
>>> >
>>> > I record logs, stack trace and snapshot for machine's status (CPU, IO,
>>> Network, etc.) for both good and bad machine. If there is a need for more
>>> informations, please let me know. Thanks in advance.
>>> >
>>> > Best Regards,
>>> > Tony Wei.
>>> > <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_l
>>> og.log><good_tm_pic.png><good_tm_stack.log>
>>>
>>>
>>
>>
>

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

Reply via email to