Ok, this makes sense. I'm guessing loading state from S3 into RocksDB is a large contributor to start delay then.
Thanks! On Tue, Jan 19, 2021 at 12:16 PM Piotr Nowojski <pnowoj...@apache.org> wrote: > Hi Rex, > > start delay is not the same as the alignment time. Start delay is the time > between creation of the checkpoint barrier and the time a task/subtask sees > a first checkpoint barrier from any of its inputs. Alignment time is the > time between receiving the first checkpoint barrier on a given subtask and > the last one. In other words, > > start of the checkpoint TS (on JobManager) + start delay on subtask = > start of the checkpoint TS (on TaskManager) > start of the checkpoint TS (on TaskManager) + alignment time on subtask = > end of the checkpoint TS (on TaskManager) > > Maybe something in your job must ramp up and record throughput is slower > during this time, causing higher back pressure, which in turns is causing > longer checkpointing time for the first checkpoint after recovery. Maybe > RocksDB is needs to load it's state from disks. > > Piotrek > > wt., 19 sty 2021 o 20:11 Rex Fenley <r...@remind101.com> napisał(a): > >> Thanks for the input. >> >> This seems odd though, if start delay is the same as alignment then (1) >> why is it only ever prominent when right after recovering from a >> checkpoint? (2) Why is the first checkpoint during the recovery process 10x >> as long as every other checkpoint? Something else must be going on that's >> in addition to the normal alignment process. >> >> On Tue, Jan 19, 2021 at 8:14 AM Piotr Nowojski <pnowoj...@apache.org> >> wrote: >> >>> Hey Rex, >>> >>> What do you mean by "Start Delay" when recovering from a checkpoint? Did >>> you mean when taking a checkpoint? If so: >>> >>> 1. https://www.google.com/search?q=flink+checkpoint+start+delay >>> 2. top 3 result (at least for me) >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html >>> > Start Delay: The time it took for the first checkpoint barrier to >>> reach this subtasks since the checkpoint barrier has been created. >>> >>> 3. https://www.google.com/search?q=flink+checkpoint+barrier >>> 4. top 2 result (at least for me) >>> https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html#barriers >>> > A core element in Flink’s distributed snapshotting are the stream >>> barriers. These barriers are injected into the data stream and flow with >>> the records as part of the data stream. >>> >>> Long start delay or alignment time means checkpoint barriers are >>> propagating slowly through the job graph, usually a symptom of a >>> back-pressure. It's best to solve the back-pressure problem, via optimising >>> your job or scaling it up. >>> >>> Alternatively you could use unaligned checkpoints [1], at a cost of >>> larger checkpoint size and higher IO usage. Note here that if you are using >>> Flink 1.12.x, I would refrain from using unaligned checkpoints on the >>> production because of some bugs [2] that we are fixing right now. On Flink >>> 1.11.x it should be fine. >>> >>> Cheers, >>> Piotrek >>> >>> [1] >>> https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html >>> [2] https://issues.apache.org/jira/browse/FLINK-20654 >>> >>> >>> >>> pon., 18 sty 2021 o 21:32 Rex Fenley <r...@remind101.com> napisał(a): >>> >>>> Hello, >>>> >>>> When we are recovering on a checkpoint it will take multiple minutes. >>>> The time is usually taken by "Start Delay". What is Start Delay and how can >>>> we optimize for it? >>>> >>>> Thanks! >>>> >>>> -- >>>> >>>> Rex Fenley | Software Engineer - Mobile and Backend >>>> >>>> >>>> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >>>> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >>>> <https://www.facebook.com/remindhq> >>>> >>> >> >> -- >> >> Rex Fenley | Software Engineer - Mobile and Backend >> >> >> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >> <https://www.facebook.com/remindhq> >> > -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | FOLLOW US <https://twitter.com/remindhq> | LIKE US <https://www.facebook.com/remindhq>