Re: Samza Job Slow to Restart

XiaoChuan Yu Sat, 23 Sep 2017 15:18:18 -0700

>> How long does it take?
It took around 10 minute from "Got offset 0 for topic <checkpoint topic> ...
" to init() being called on the Task.


>> Have you measured which parts of the start up sequence take the most
time?
>> - is it checkpoint restoration, or restore of local state?
Should be checkpoint restoration. There is no local state for this job.

>> If reading from the checkpoint topic takes the most time, then I'd
>> recommend reading from the beginning from that topic, and benchmarking
how
>> long it takes? It'll also help to verify if the checkpoint topic is
>> actually log-compacted.
I'm not sure how to verify how much the topic is compacted by Kafka.
The cleanup policy is to compact though.

>> Do containers eventually start? Or does the start-up hang?
>> If so, a thread dump will be useful.
It does eventually start up.

>> Can you please link and attach the entire log file for us to take a look?
Unfortunately there is too much stuff for me to redact from the log right
now.
However, I can tell you that the job has two input topics both with the
following settings:
systems.kafka.streams.my-special-topic.samza.reset.offset=true
systems.kafka.streams.my-special-topic.samza.offset.default=upcoming
It was thought that this would speedup startup of the job to no avail.

On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <[email protected]>
wrote:

> Hi Xiaochuan,
>
> >> What does that loop do exactly?
>
> Most of what the run-loop does is documented in
> https://samza.apache.org/learn/documentation/0.9/container/event-loop.html
>
> >> We are running into a problem where it seems to take a very long time to
> restart a Samza job.
>
> Some follow-up questions,
>
> How long does it take?
> Have you measured which parts of the start up sequence take the most time?
> - is it checkpoint restoration, or restore of local state?
> If reading from the checkpoint topic takes the most time, then I'd
> recommend reading from the beginning from that topic, and benchmarking how
> long it takes? It'll also help to verify if the checkpoint topic is
> actually log-compacted.
> Do containers eventually start? Or does the start-up hang? If so, a thread
> dump will be useful.
> Can you please link and attach the entire log file for us to take a look?
>
> >> 3. Any ideas on how to fix this?
>
> We can perhaps, try to narrow down where the time is spent in startup from
> the logs? Depending on that, I can suggest a fix :-)
>
> Thanks,
> Jagadish
>
> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <[email protected]>
> wrote:
>
> > Hi,
> >
> > We are running into a problem where it seems to take a very long time to
> > restart a Samza job.
> > We are using Samza 0.9.1 at the moment.
> >
> > From the logs for a particular container it looks like it has something
> to
> > do with reading checkpoints from Kafka:
> >
> > 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager [main]
> > -
> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
> > partition 0. Attempting to fetch messages for checkpoint log.
> > 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager [main]
> > -
> > Get latest offset 42890599 for topic
> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
> >
> > Looking at this line in KafkaCheckpointManager
> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
> > src/main/scala/org/apache/samza/checkpoint/kafka/
> > KafkaCheckpointManager.scala#L275>,
> > it seems to indicate that the loop iterates from 0 to 42890599 and make
> > requests for each.
> >
> > Questions:
> > 1. What does that loop do exactly?
> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..." normal?
> > 3. Any ideas on how to fix this?
> >
> > Thanks,
> > Xiaochuan Yu
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: Samza Job Slow to Restart

Reply via email to