Hi Bekir

First of all, I think there is something wrong.  the state size is almost
the same,  but the duration is different so much.

The checkpoint for RocksDBStatebackend is dump sst files, then copy the
needed sst files(if you enable incremental checkpoint, the sst files
already on remote will not upload), then complete checkpoint. Can you check
the network bandwidth usage during checkpoint?

Best,
Congxian


Bekir Oguz <bekir.o...@persgroep.net> 于2019年7月16日周二 下午10:45写道:

> Hi all,
> We have a flink job with user state, checkpointing to RocksDBBackend which
> is externally stored in AWS S3.
> After we have migrated our cluster from 1.6 to 1.8, we see occasionally
> that some slots do to acknowledge the checkpoints quick enough. As an
> example: All slots acknowledge between 30-50 seconds except only one slot
> acknowledges in 15 mins. Checkpoint sizes are similar to each other, like
> 200-400 MB.
>
> We did not experience this weird behaviour in Flink 1.6. We have 5 min
> checkpoint interval and this happens sometimes once in an hour sometimes
> more but not in all the checkpoint requests. Please see the screenshot
> below.
>
> Also another point: For the faulty slots, the duration is consistently 15
> mins and some seconds, we couldn’t find out where this 15 mins response
> time comes from. And each time it is a different task manager, not always
> the same one.
>
> Do you guys aware of any other users having similar issues with the new
> version and also a suggested bug fix or solution?
>
>
>
>
> Thanks in advance,
> Bekir Oguz
>

Reply via email to