Hi Bekir, Another user reported checkpointing issues with Flink 1.8.0 [1]. These seem to be resolved with Flink 1.8.1.
Hope this helps, Fabian [1] https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E Am Mi., 17. Juli 2019 um 09:16 Uhr schrieb Congxian Qiu < qcx978132...@gmail.com>: > Hi Bekir > > First of all, I think there is something wrong. the state size is almost > the same, but the duration is different so much. > > The checkpoint for RocksDBStatebackend is dump sst files, then copy the > needed sst files(if you enable incremental checkpoint, the sst files > already on remote will not upload), then complete checkpoint. Can you check > the network bandwidth usage during checkpoint? > > Best, > Congxian > > > Bekir Oguz <bekir.o...@persgroep.net> 于2019年7月16日周二 下午10:45写道: > >> Hi all, >> We have a flink job with user state, checkpointing to RocksDBBackend >> which is externally stored in AWS S3. >> After we have migrated our cluster from 1.6 to 1.8, we see occasionally >> that some slots do to acknowledge the checkpoints quick enough. As an >> example: All slots acknowledge between 30-50 seconds except only one slot >> acknowledges in 15 mins. Checkpoint sizes are similar to each other, like >> 200-400 MB. >> >> We did not experience this weird behaviour in Flink 1.6. We have 5 min >> checkpoint interval and this happens sometimes once in an hour sometimes >> more but not in all the checkpoint requests. Please see the screenshot >> below. >> >> Also another point: For the faulty slots, the duration is consistently 15 >> mins and some seconds, we couldn’t find out where this 15 mins response >> time comes from. And each time it is a different task manager, not always >> the same one. >> >> Do you guys aware of any other users having similar issues with the new >> version and also a suggested bug fix or solution? >> >> >> >> >> Thanks in advance, >> Bekir Oguz >> >