Hi Xiaochuan, >> What does that loop do exactly?
Most of what the run-loop does is documented in https://samza.apache.org/learn/documentation/0.9/container/event-loop.html >> We are running into a problem where it seems to take a very long time to restart a Samza job. Some follow-up questions, How long does it take? Have you measured which parts of the start up sequence take the most time? - is it checkpoint restoration, or restore of local state? If reading from the checkpoint topic takes the most time, then I'd recommend reading from the beginning from that topic, and benchmarking how long it takes? It'll also help to verify if the checkpoint topic is actually log-compacted. Do containers eventually start? Or does the start-up hang? If so, a thread dump will be useful. Can you please link and attach the entire log file for us to take a look? >> 3. Any ideas on how to fix this? We can perhaps, try to narrow down where the time is spent in startup from the logs? Depending on that, I can suggest a fix :-) Thanks, Jagadish On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xiaochuan...@kik.com> wrote: > Hi, > > We are running into a problem where it seems to take a very long time to > restart a Samza job. > We are using Samza 0.9.1 at the moment. > > From the logs for a particular container it looks like it has something to > do with reading checkpoints from Kafka: > > 2017-09-20 03:21:02.060 INFO o.a.s.c.kafka.KafkaCheckpointManager [main] > - > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and > partition 0. Attempting to fetch messages for checkpoint log. > 2017-09-20 03:21:02.072 INFO o.a.s.c.kafka.KafkaCheckpointManager [main] > - > Get latest offset 42890599 for topic > __samza_checkpoint_ver_1_for_test-job_1 and partition 0. > > Looking at this line in KafkaCheckpointManager > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/ > src/main/scala/org/apache/samza/checkpoint/kafka/ > KafkaCheckpointManager.scala#L275>, > it seems to indicate that the loop iterates from 0 to 42890599 and make > requests for each. > > Questions: > 1. What does that loop do exactly? > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..." normal? > 3. Any ideas on how to fix this? > > Thanks, > Xiaochuan Yu > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University