Re: instable checkpointing after migration to flink 1.8

Congxian Qiu Thu, 01 Aug 2019 05:26:38 -0700

Hi Bekir

I'll first comb through all the information here, and try to find out the
reason with you, maybe need you to share some more information :)


Best,
Congxian


Bekir Oguz <[email protected]> 于2019年8月1日周四 下午5:00写道：

> Hi Fabian,
> Thanks for sharing this with us, but we’re already on version 1.8.1.
>
> What I don’t understand is which mechanism in Flink adds 15 minutes to the
> checkpoint duration occasionally. Can you maybe give us some hints on where
> to look at? Is there a default timeout of 15 minutes defined somewhere in
> Flink? I couldn’t find one.
>
> In our pipeline, most of the checkpoints complete in less than a minute
> and some of them completed in 15 minutes+(less than a minute).
> There’s definitely something which adds 15 minutes. This is happening in
> one or more subtasks during checkpointing.
>
> Please see the screenshot below:
>
> Regards,
> Bekir
>
>
>
> Op 23 jul. 2019, om 16:37 heeft Fabian Hueske <[email protected]> het
> volgende geschreven:
>
> Hi Bekir,
>
> Another user reported checkpointing issues with Flink 1.8.0 [1].
> These seem to be resolved with Flink 1.8.1.
>
> Hope this helps,
> Fabian
>
> [1]
>
> https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E
>
> Am Mi., 17. Juli 2019 um 09:16 Uhr schrieb Congxian Qiu <
> [email protected]>:
>
> Hi Bekir
>
> First of all, I think there is something wrong.  the state size is almost
> the same,  but the duration is different so much.
>
> The checkpoint for RocksDBStatebackend is dump sst files, then copy the
> needed sst files(if you enable incremental checkpoint, the sst files
> already on remote will not upload), then complete checkpoint. Can you check
> the network bandwidth usage during checkpoint?
>
> Best,
> Congxian
>
>
> Bekir Oguz <[email protected]> 于2019年7月16日周二 下午10:45写道：
>
> Hi all,
> We have a flink job with user state, checkpointing to RocksDBBackend
> which is externally stored in AWS S3.
> After we have migrated our cluster from 1.6 to 1.8, we see occasionally
> that some slots do to acknowledge the checkpoints quick enough. As an
> example: All slots acknowledge between 30-50 seconds except only one slot
> acknowledges in 15 mins. Checkpoint sizes are similar to each other, like
> 200-400 MB.
>
> We did not experience this weird behaviour in Flink 1.6. We have 5 min
> checkpoint interval and this happens sometimes once in an hour sometimes
> more but not in all the checkpoint requests. Please see the screenshot
> below.
>
> Also another point: For the faulty slots, the duration is consistently 15
> mins and some seconds, we couldn’t find out where this 15 mins response
> time comes from. And each time it is a different task manager, not always
> the same one.
>
> Do you guys aware of any other users having similar issues with the new
> version and also a suggested bug fix or solution?
>
>
>
>
> Thanks in advance,
> Bekir Oguz
>
>
>
>

Re: instable checkpointing after migration to flink 1.8

Reply via email to