Hi Tony,


About to your question: average end to end latency of checkpoint is less than 
1.5 mins, doesn't means that checkpoint won't timeout. indeed, it determined 
byt the max end to end latency (the slowest one), a checkpoint truly completed 
only after all task's checkpoint have completed.


About to the problem: after a second look at the info you privode, we can found 
from the checkpoint detail picture that there is one task which cost 4m20s to 
transfer it snapshot (about 482M) to s3 and there are 4 others tasks didn't 
complete the checkpoint yet. And from the bad_tm_pic.png vs good_tm_pic.png, we 
can found that on "bad tm" the network performance is far less than the "good 
tm" (-15 MB vs -50MB). So I guss the network is a problem, sometimes it failed 
to send 500M data to s3 in 10 minutes. (maybe you need to check whether the 
network env is stable)


About the solution: I think incremental checkpoint can help you a lot, it will 
only send the new data each checkpoint, but you are right if the increment 
state size is huger than 500M, it might cause the timeout problem again 
(because of the bad network performance).


Best Regards,
Sihua Zhou


发自网易邮箱大师


On 03/6/2018 13:02,Tony Wei<tony19920...@gmail.com> wrote:
Hi Sihua,


Thanks for your suggestion. "incremental checkpoint" is what I will try out 
next and I know it will give a better performance. However, it might not solve 
this issue completely, because as I said, the average end to end latency of 
checkpointing is less than 1.5 mins currently, and it is far from my timeout 
configuration. I believe "incremental checkpoint" will reduce the latency and 
make this issue might occur seldom, but I can't promise it won't happen again 
if I have bigger states growth in the future. Am I right?


Best Regards,
Tony Wei 


2018-03-06 10:55 GMT+08:00 周思华 <summerle...@163.com>:

Hi Tony,


Sorry for jump into, one thing I want to remind is that from the log you 
provided it looks like you are using "full checkpoint", this means that the 
state data that need to be checkpointed and transvered to s3 will grow over 
time, and even for the first checkpoint it performance is slower that 
incremental checkpoint (because it need to iterate all the record from the 
rocksdb using the RocksDBMergeIterator). Maybe you can try out "incremental 
checkpoint", it could help you got a better performance.


Best Regards,
Sihua Zhou


发自网易邮箱大师


On 03/6/2018 10:34,Tony Wei<tony19920...@gmail.com> wrote:
Hi Stefan,


I see. That explains why the loading of machines grew up. However, I think it 
is not the root cause that led to these consecutive checkpoint timeout. As I 
said in my first mail, the checkpointing progress usually took 1.5 mins to 
upload states, and this operator and kafka consumer are only two operators that 
have states in my pipeline. In the best case, I should never encounter the 
timeout problem that only caused by lots of pending checkpointing threads that 
have already timed out. Am I right?


Since these logging and stack trace was taken after nearly 3 hours from the 
first checkpoint timeout, I'm afraid that we couldn't actually find out the 
root cause for the first checkpoint timeout. Because we are preparing to make 
this pipeline go on production, I was wondering if you could help me find out 
where the root cause happened: bad machines or s3 or flink-s3-presto packages 
or flink checkpointing thread. It will be great if we can find it out from 
those informations the I provided, or a hypothesis based on your experience is 
welcome as well. The most important thing is that I have to decide whether I 
need to change my persistence filesystem or use another s3 filesystem package, 
because it is the last thing I want to see that the checkpoint timeout happened 
very often.


Thank you very much for all your advices.


Best Regards,
Tony Wei


2018-03-06 1:07 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>:

Hi,


thanks for all the info. I had a look into the problem and opened 
https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your stack 
trace, you can see many checkpointing threads are running on your TM for 
checkpoints that have already timed out, and I think this cascades and slows 
down everything. Seems like the implementation of some features like checkpoint 
timeouts and not failing tasks from checkpointing problems overlooked that we 
also require to properly communicate that checkpoint cancellation to all task, 
which was not needed before.


Best,
Stefan




Am 05.03.2018 um 14:42 schrieb Tony Wei <tony19920...@gmail.com>:


Hi Stefan,


Here is my checkpointing configuration.



| Checkpointing Mode | Exactly Once |
| Interval | 20m 0s |
| Timeout | 10m 0s |
| Minimum Pause Between Checkpoints | 0ms |
| Maximum Concurrent Checkpoints | 1 |
| Persist Checkpoints Externally | Enabled (delete on cancellation) |
Best Regards,

Tony Wei


2018-03-05 21:30 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com>:
Hi,

quick question: what is your exact checkpointing configuration? In particular, 
what is your value for the maximum parallel checkpoints and the minimum time 
interval to wait between two checkpoints?

Best,
Stefan

> Am 05.03.2018 um 06:34 schrieb Tony Wei <tony19920...@gmail.com>:
>
> Hi all,
>
> Last weekend, my flink job's checkpoint start failing because of timeout. I 
> have no idea what happened, but I collect some informations about my cluster 
> and job. Hope someone can give me advices or hints about the problem that I 
> encountered.
>
> My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each has 4 
> cores. These machines are ec2 spot instances. The job's parallelism is set as 
> 32, using rocksdb as state backend and s3 presto as checkpoint file system.
> The state's size is nearly 15gb and still grows day-by-day. Normally, It 
> takes 1.5 mins to finish the whole checkpoint process. The timeout 
> configuration is set as 10 mins.
>
> <chk_snapshot.png>
>
> As the picture shows, not each subtask of checkpoint broke caused by timeout, 
> but each machine has ever broken for all its subtasks during last weekend. 
> Some machines recovered by themselves and some machines recovered after I 
> restarted them.
>
> I record logs, stack trace and snapshot for machine's status (CPU, IO, 
> Network, etc.) for both good and bad machine. If there is a need for more 
> informations, please let me know. Thanks in advance.
>
> Best Regards,
> Tony Wei.
> <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_log.log><good_tm_pic.png><good_tm_stack.log>









Reply via email to