Ok, this makes sense. I'm guessing loading state from S3 into RocksDB is a
large contributor to start delay then.

Thanks!

On Tue, Jan 19, 2021 at 12:16 PM Piotr Nowojski <pnowoj...@apache.org>
wrote:

> Hi Rex,
>
> start delay is not the same as the alignment time. Start delay is the time
> between creation of the checkpoint barrier and the time a task/subtask sees
> a first checkpoint barrier from any of its inputs. Alignment time is the
> time between receiving the first checkpoint barrier on a given subtask and
> the last one. In other words,
>
> start of the checkpoint TS (on JobManager) + start delay on subtask =
> start of the checkpoint TS (on TaskManager)
> start of the checkpoint TS (on TaskManager) + alignment time on subtask =
> end of the checkpoint TS (on TaskManager)
>
> Maybe something in your job must ramp up and record throughput is slower
> during this time, causing higher back pressure, which in turns is causing
> longer checkpointing time for the first checkpoint after recovery. Maybe
> RocksDB is needs to load it's state from disks.
>
> Piotrek
>
> wt., 19 sty 2021 o 20:11 Rex Fenley <r...@remind101.com> napisał(a):
>
>> Thanks for the input.
>>
>> This seems odd though, if start delay is the same as alignment then (1)
>> why is it only ever prominent when right after recovering from a
>> checkpoint? (2) Why is the first checkpoint during the recovery process 10x
>> as long as every other checkpoint? Something else must be going on that's
>> in addition to the normal alignment process.
>>
>> On Tue, Jan 19, 2021 at 8:14 AM Piotr Nowojski <pnowoj...@apache.org>
>> wrote:
>>
>>> Hey Rex,
>>>
>>> What do you mean by "Start Delay" when recovering from a checkpoint? Did
>>> you mean when taking a checkpoint? If so:
>>>
>>> 1. https://www.google.com/search?q=flink+checkpoint+start+delay
>>> 2. top 3 result (at least for me)
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
>>> > Start Delay: The time it took for the first checkpoint barrier to
>>> reach this subtasks since the checkpoint barrier has been created.
>>>
>>> 3. https://www.google.com/search?q=flink+checkpoint+barrier
>>> 4. top 2 result (at least for me)
>>> https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html#barriers
>>> > A core element in Flink’s distributed snapshotting are the stream
>>> barriers. These barriers are injected into the data stream and flow with
>>> the records as part of the data stream.
>>>
>>> Long start delay or alignment time means checkpoint barriers are
>>> propagating slowly through the job graph, usually a symptom of a
>>> back-pressure. It's best to solve the back-pressure problem, via optimising
>>> your job or scaling it up.
>>>
>>> Alternatively you could use unaligned checkpoints [1], at a cost of
>>> larger checkpoint size and higher IO usage. Note here that if you are using
>>> Flink 1.12.x, I would refrain from using unaligned checkpoints on the
>>> production because of some bugs [2] that we are fixing right now. On Flink
>>> 1.11.x it should be fine.
>>>
>>> Cheers,
>>> Piotrek
>>>
>>> [1]
>>> https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html
>>> [2] https://issues.apache.org/jira/browse/FLINK-20654
>>>
>>>
>>>
>>> pon., 18 sty 2021 o 21:32 Rex Fenley <r...@remind101.com> napisał(a):
>>>
>>>> Hello,
>>>>
>>>> When we are recovering on a checkpoint it will take multiple minutes.
>>>> The time is usually taken by "Start Delay". What is Start Delay and how can
>>>> we optimize for it?
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>>
>>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>>
>>>>
>>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>>> <https://www.facebook.com/remindhq>
>>>>
>>>
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>> <https://www.facebook.com/remindhq>
>>
>

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to