Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-05-04 Thread Stefan Richter
Cool, that is good news! Thanks for sharing this information with us, Best, Stefan > Am 04.05.2018 um 12:27 schrieb Tony Wei : > > have replaced to local SSDs and enabled incremental checkpoint mechanism as > well. Our job has run healthily for more than two weeks.

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-05-04 Thread Tony Wei
heck, >>>> when the state is huge this is cpu costly. Let me try to explain the full >>>> checkpoint a bit more, it contains two parts. >>>> >>>> Part 1. Take snapshot of the RocksDB. (This can map to the "Checkpoint >>>> D

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-09 Thread Tony Wei
quot;). >>> >>> So part2 could be cpu costly and network costly, if the CPU load is too >>> high, then sending data will slow down, because there are in a single loop. >>> If cpu is the reason, this phenomenon will disappear if you use increment >>> chec

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-09 Thread Stefan Richter
> is the reason, this phenomenon will disappear if you use increment > checkpoint, because it almost only send data to s3. In the all, for now > trying out the incremental checkpoint is the best thing to do I think. > > Best Regards, > Sihua Zhou > > > 发自网易邮箱大师 >

Re: Fwd: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
14:45,Tony Wei > wrote: > > Sent to the wrong mailing list. Forward it to the correct one. > > -- Forwarded message -- > From: Tony Wei > Date: 2018-03-06 14:43 GMT+08:00 > Subject: Re: checkpoint stuck with rocksdb statebackend and s3 filesystem > To: 周

Fwd: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Sent to the wrong mailing list. Forward it to the correct one. -- Forwarded message -- From: Tony Wei Date: 2018-03-06 14:43 GMT+08:00 Subject: Re: checkpoint stuck with rocksdb statebackend and s3 filesystem To: 周思华 , Stefan Richter Cc: "user-subscr...@flink.apache.org&q

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread 周思华
Hi Tony, About to your question: average end to end latency of checkpoint is less than 1.5 mins, doesn't means that checkpoint won't timeout. indeed, it determined byt the max end to end latency (the slowest one), a checkpoint truly completed only after all task's checkpoint have completed.

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Hi Sihua, Thanks for your suggestion. "incremental checkpoint" is what I will try out next and I know it will give a better performance. However, it might not solve this issue completely, because as I said, the average end to end latency of checkpointing is less than 1.5 mins currently, and it is

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread 周思华
Hi Tony, Sorry for jump into, one thing I want to remind is that from the log you provided it looks like you are using "full checkpoint", this means that the state data that need to be checkpointed and transvered to s3 will grow over time, and even for the first checkpoint it performance is sl

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Hi Stefan, I see. That explains why the loading of machines grew up. However, I think it is not the root cause that led to these consecutive checkpoint timeout. As I said in my first mail, the checkpointing progress usually took 1.5 mins to upload states, and this operator and kafka consumer are o

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Stefan Richter
Hi, thanks for all the info. I had a look into the problem and opened https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your stack trace, you can see many checkpointing threads are running on your TM for checkpoints that have

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Tony Wei
Hi Stefan, Here is my checkpointing configuration. Checkpointing Mode Exactly Once Interval 20m 0s Timeout 10m 0s Minimum Pause Between Checkpoints 0ms Maximum Concurrent Checkpoints 1 Persist Checkpoints Externally Enabled (delete on cancellation) Best Regards, Tony Wei 2018-03-05 21:30 GMT+08:

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

2018-03-05 Thread Stefan Richter
Hi, quick question: what is your exact checkpointing configuration? In particular, what is your value for the maximum parallel checkpoints and the minimum time interval to wait between two checkpoints? Best, Stefan > Am 05.03.2018 um 06:34 schrieb Tony Wei : > > Hi all, > > Last weekend, my