Re: Samza Job Slow to Restart

Liu Bo Tue, 24 Oct 2017 21:51:23 -0700

met the same problem before and resolved with Yi's help, xD

On 20 October 2017 at 06:10, Yi Pan <[email protected]> wrote:


> Awesome that you have figured it out! Just a general notice: any logcompact
> topic used in Samza may see this slow-down if the Kafka log cleaner thread
> dies, which include checkpoint, coordinator stream, and changelog topics.
>
> Best!
>
> -Yi
>
> On Thu, Oct 19, 2017 at 12:14 PM, XiaoChuan Yu <[email protected]>
> wrote:
>
> > Hi,
> >
> > We were finally able to find out why the job takes so long to start.
> > There was higher than normal network IO during job startup and so we
> > checked size of the checkpoint topic on disk and it was ~21GB.
> > We then restarted the Kafka node who was the leader for the checkpoint
> > topic, the topic disk size went down to ~1.8GB and the job started up
> > fairly quickly.
> > Its probably due to a bug in Kafka where log cleaner died and we never
> > noticed: https://issues.apache.org/jira/browse/KAFKA-3894.
> > We have since been working on upgrading Kafka to avoid this bug.
> > Hope this helps if anyone else ever runs into it.
> >
> > Thanks,
> > Xiaochuan Yu
> >
> > On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <[email protected]>
> wrote:
> >
> > > >> How long does it take?
> > > It took around 10 minute from "Got offset 0 for topic <checkpoint
> topic>
> > > ..." to init() being called on the Task.
> > >
> > > >> Have you measured which parts of the start up sequence take the most
> > > time?
> > > >> - is it checkpoint restoration, or restore of local state?
> > > Should be checkpoint restoration. There is no local state for this job.
> > >
> > > >> If reading from the checkpoint topic takes the most time, then I'd
> > > >> recommend reading from the beginning from that topic, and
> benchmarking
> > > how
> > > >> long it takes? It'll also help to verify if the checkpoint topic is
> > > >> actually log-compacted.
> > > I'm not sure how to verify how much the topic is compacted by Kafka.
> > > The cleanup policy is to compact though.
> > >
> > > >> Do containers eventually start? Or does the start-up hang?
> > > >> If so, a thread dump will be useful.
> > > It does eventually start up.
> > >
> > > >> Can you please link and attach the entire log file for us to take a
> > > look?
> > > Unfortunately there is too much stuff for me to redact from the log
> right
> > > now.
> > > However, I can tell you that the job has two input topics both with the
> > > following settings:
> > > systems.kafka.streams.my-special-topic.samza.reset.offset=true
> > > systems.kafka.streams.my-special-topic.samza.offset.default=upcoming
> > > It was thought that this would speedup startup of the job to no avail.
> > >
> > > On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <
> > > [email protected]> wrote:
> > >
> > >> Hi Xiaochuan,
> > >>
> > >> >> What does that loop do exactly?
> > >>
> > >> Most of what the run-loop does is documented in
> > >> https://samza.apache.org/learn/documentation/0.9/
> > container/event-loop.html
> > >>
> > >> >> We are running into a problem where it seems to take a very long
> time
> > >> to
> > >> restart a Samza job.
> > >>
> > >> Some follow-up questions,
> > >>
> > >> How long does it take?
> > >> Have you measured which parts of the start up sequence take the most
> > time?
> > >> - is it checkpoint restoration, or restore of local state?
> > >> If reading from the checkpoint topic takes the most time, then I'd
> > >> recommend reading from the beginning from that topic, and benchmarking
> > how
> > >> long it takes? It'll also help to verify if the checkpoint topic is
> > >> actually log-compacted.
> > >> Do containers eventually start? Or does the start-up hang? If so, a
> > thread
> > >> dump will be useful.
> > >> Can you please link and attach the entire log file for us to take a
> > look?
> > >>
> > >> >> 3. Any ideas on how to fix this?
> > >>
> > >> We can perhaps, try to narrow down where the time is spent in startup
> > from
> > >> the logs? Depending on that, I can suggest a fix :-)
> > >>
> > >> Thanks,
> > >> Jagadish
> > >>
> > >> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <[email protected]>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We are running into a problem where it seems to take a very long
> time
> > to
> > >> > restart a Samza job.
> > >> > We are using Samza 0.9.1 at the moment.
> > >> >
> > >> > From the logs for a particular container it looks like it has
> > something
> > >> to
> > >> > do with reading checkpoints from Kafka:
> > >> >
> > >> > 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> > >> [main]
> > >> > -
> > >> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
> > >> > partition 0. Attempting to fetch messages for checkpoint log.
> > >> > 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> > >> [main]
> > >> > -
> > >> > Get latest offset 42890599 for topic
> > >> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
> > >> >
> > >> > Looking at this line in KafkaCheckpointManager
> > >> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
> > >> > src/main/scala/org/apache/samza/checkpoint/kafka/
> > >> > KafkaCheckpointManager.scala#L275>,
> > >> > it seems to indicate that the loop iterates from 0 to 42890599 and
> > make
> > >> > requests for each.
> > >> >
> > >> > Questions:
> > >> > 1. What does that loop do exactly?
> > >> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..."
> > >> normal?
> > >> > 3. Any ideas on how to fix this?
> > >> >
> > >> > Thanks,
> > >> > Xiaochuan Yu
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Jagadish V,
> > >> Graduate Student,
> > >> Department of Computer Science,
> > >> Stanford University
> > >>
> > >
> >
>



-- 
All the best

Liu Bo

Re: Samza Job Slow to Restart

Reply via email to