Hi Bekir I'll first comb through all the information here, and try to find out the reason with you, maybe need you to share some more information :)
Best, Congxian Bekir Oguz <bekir.o...@persgroep.net> 于2019年8月1日周四 下午5:00写道: > Hi Fabian, > Thanks for sharing this with us, but we’re already on version 1.8.1. > > What I don’t understand is which mechanism in Flink adds 15 minutes to the > checkpoint duration occasionally. Can you maybe give us some hints on where > to look at? Is there a default timeout of 15 minutes defined somewhere in > Flink? I couldn’t find one. > > In our pipeline, most of the checkpoints complete in less than a minute > and some of them completed in 15 minutes+(less than a minute). > There’s definitely something which adds 15 minutes. This is happening in > one or more subtasks during checkpointing. > > Please see the screenshot below: > > Regards, > Bekir > > > > Op 23 jul. 2019, om 16:37 heeft Fabian Hueske <fhue...@gmail.com> het > volgende geschreven: > > Hi Bekir, > > Another user reported checkpointing issues with Flink 1.8.0 [1]. > These seem to be resolved with Flink 1.8.1. > > Hope this helps, > Fabian > > [1] > > https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E > > Am Mi., 17. Juli 2019 um 09:16 Uhr schrieb Congxian Qiu < > qcx978132...@gmail.com>: > > Hi Bekir > > First of all, I think there is something wrong. the state size is almost > the same, but the duration is different so much. > > The checkpoint for RocksDBStatebackend is dump sst files, then copy the > needed sst files(if you enable incremental checkpoint, the sst files > already on remote will not upload), then complete checkpoint. Can you check > the network bandwidth usage during checkpoint? > > Best, > Congxian > > > Bekir Oguz <bekir.o...@persgroep.net> 于2019年7月16日周二 下午10:45写道: > > Hi all, > We have a flink job with user state, checkpointing to RocksDBBackend > which is externally stored in AWS S3. > After we have migrated our cluster from 1.6 to 1.8, we see occasionally > that some slots do to acknowledge the checkpoints quick enough. As an > example: All slots acknowledge between 30-50 seconds except only one slot > acknowledges in 15 mins. Checkpoint sizes are similar to each other, like > 200-400 MB. > > We did not experience this weird behaviour in Flink 1.6. We have 5 min > checkpoint interval and this happens sometimes once in an hour sometimes > more but not in all the checkpoint requests. Please see the screenshot > below. > > Also another point: For the faulty slots, the duration is consistently 15 > mins and some seconds, we couldn’t find out where this 15 mins response > time comes from. And each time it is a different task manager, not always > the same one. > > Do you guys aware of any other users having similar issues with the new > version and also a suggested bug fix or solution? > > > > > Thanks in advance, > Bekir Oguz > > > >