>> How long does it take? It took around 10 minute from "Got offset 0 for topic <checkpoint topic> ... " to init() being called on the Task.
>> Have you measured which parts of the start up sequence take the most time? >> - is it checkpoint restoration, or restore of local state? Should be checkpoint restoration. There is no local state for this job. >> If reading from the checkpoint topic takes the most time, then I'd >> recommend reading from the beginning from that topic, and benchmarking how >> long it takes? It'll also help to verify if the checkpoint topic is >> actually log-compacted. I'm not sure how to verify how much the topic is compacted by Kafka. The cleanup policy is to compact though. >> Do containers eventually start? Or does the start-up hang? >> If so, a thread dump will be useful. It does eventually start up. >> Can you please link and attach the entire log file for us to take a look? Unfortunately there is too much stuff for me to redact from the log right now. However, I can tell you that the job has two input topics both with the following settings: systems.kafka.streams.my-special-topic.samza.reset.offset=true systems.kafka.streams.my-special-topic.samza.offset.default=upcoming It was thought that this would speedup startup of the job to no avail. On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <jagadish1...@gmail.com> wrote: > Hi Xiaochuan, > > >> What does that loop do exactly? > > Most of what the run-loop does is documented in > https://samza.apache.org/learn/documentation/0.9/container/event-loop.html > > >> We are running into a problem where it seems to take a very long time to > restart a Samza job. > > Some follow-up questions, > > How long does it take? > Have you measured which parts of the start up sequence take the most time? > - is it checkpoint restoration, or restore of local state? > If reading from the checkpoint topic takes the most time, then I'd > recommend reading from the beginning from that topic, and benchmarking how > long it takes? It'll also help to verify if the checkpoint topic is > actually log-compacted. > Do containers eventually start? Or does the start-up hang? If so, a thread > dump will be useful. > Can you please link and attach the entire log file for us to take a look? > > >> 3. Any ideas on how to fix this? > > We can perhaps, try to narrow down where the time is spent in startup from > the logs? Depending on that, I can suggest a fix :-) > > Thanks, > Jagadish > > On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xiaochuan...@kik.com> > wrote: > > > Hi, > > > > We are running into a problem where it seems to take a very long time to > > restart a Samza job. > > We are using Samza 0.9.1 at the moment. > > > > From the logs for a particular container it looks like it has something > to > > do with reading checkpoints from Kafka: > > > > 2017-09-20 03:21:02.060 INFO o.a.s.c.kafka.KafkaCheckpointManager [main] > > - > > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and > > partition 0. Attempting to fetch messages for checkpoint log. > > 2017-09-20 03:21:02.072 INFO o.a.s.c.kafka.KafkaCheckpointManager [main] > > - > > Get latest offset 42890599 for topic > > __samza_checkpoint_ver_1_for_test-job_1 and partition 0. > > > > Looking at this line in KafkaCheckpointManager > > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/ > > src/main/scala/org/apache/samza/checkpoint/kafka/ > > KafkaCheckpointManager.scala#L275>, > > it seems to indicate that the loop iterates from 0 to 42890599 and make > > requests for each. > > > > Questions: > > 1. What does that loop do exactly? > > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..." normal? > > 3. Any ideas on how to fix this? > > > > Thanks, > > Xiaochuan Yu > > > > > > -- > Jagadish V, > Graduate Student, > Department of Computer Science, > Stanford University >