met the same problem before and resolved with Yi's help, xD On 20 October 2017 at 06:10, Yi Pan <nickpa...@gmail.com> wrote:
> Awesome that you have figured it out! Just a general notice: any logcompact > topic used in Samza may see this slow-down if the Kafka log cleaner thread > dies, which include checkpoint, coordinator stream, and changelog topics. > > Best! > > -Yi > > On Thu, Oct 19, 2017 at 12:14 PM, XiaoChuan Yu <xiaochuan...@kik.com> > wrote: > > > Hi, > > > > We were finally able to find out why the job takes so long to start. > > There was higher than normal network IO during job startup and so we > > checked size of the checkpoint topic on disk and it was ~21GB. > > We then restarted the Kafka node who was the leader for the checkpoint > > topic, the topic disk size went down to ~1.8GB and the job started up > > fairly quickly. > > Its probably due to a bug in Kafka where log cleaner died and we never > > noticed: https://issues.apache.org/jira/browse/KAFKA-3894. > > We have since been working on upgrading Kafka to avoid this bug. > > Hope this helps if anyone else ever runs into it. > > > > Thanks, > > Xiaochuan Yu > > > > On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <xiaochuan...@kik.com> > wrote: > > > > > >> How long does it take? > > > It took around 10 minute from "Got offset 0 for topic <checkpoint > topic> > > > ..." to init() being called on the Task. > > > > > > >> Have you measured which parts of the start up sequence take the most > > > time? > > > >> - is it checkpoint restoration, or restore of local state? > > > Should be checkpoint restoration. There is no local state for this job. > > > > > > >> If reading from the checkpoint topic takes the most time, then I'd > > > >> recommend reading from the beginning from that topic, and > benchmarking > > > how > > > >> long it takes? It'll also help to verify if the checkpoint topic is > > > >> actually log-compacted. > > > I'm not sure how to verify how much the topic is compacted by Kafka. > > > The cleanup policy is to compact though. > > > > > > >> Do containers eventually start? Or does the start-up hang? > > > >> If so, a thread dump will be useful. > > > It does eventually start up. > > > > > > >> Can you please link and attach the entire log file for us to take a > > > look? > > > Unfortunately there is too much stuff for me to redact from the log > right > > > now. > > > However, I can tell you that the job has two input topics both with the > > > following settings: > > > systems.kafka.streams.my-special-topic.samza.reset.offset=true > > > systems.kafka.streams.my-special-topic.samza.offset.default=upcoming > > > It was thought that this would speedup startup of the job to no avail. > > > > > > On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman < > > > jagadish1...@gmail.com> wrote: > > > > > >> Hi Xiaochuan, > > >> > > >> >> What does that loop do exactly? > > >> > > >> Most of what the run-loop does is documented in > > >> https://samza.apache.org/learn/documentation/0.9/ > > container/event-loop.html > > >> > > >> >> We are running into a problem where it seems to take a very long > time > > >> to > > >> restart a Samza job. > > >> > > >> Some follow-up questions, > > >> > > >> How long does it take? > > >> Have you measured which parts of the start up sequence take the most > > time? > > >> - is it checkpoint restoration, or restore of local state? > > >> If reading from the checkpoint topic takes the most time, then I'd > > >> recommend reading from the beginning from that topic, and benchmarking > > how > > >> long it takes? It'll also help to verify if the checkpoint topic is > > >> actually log-compacted. > > >> Do containers eventually start? Or does the start-up hang? If so, a > > thread > > >> dump will be useful. > > >> Can you please link and attach the entire log file for us to take a > > look? > > >> > > >> >> 3. Any ideas on how to fix this? > > >> > > >> We can perhaps, try to narrow down where the time is spent in startup > > from > > >> the logs? Depending on that, I can suggest a fix :-) > > >> > > >> Thanks, > > >> Jagadish > > >> > > >> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xiaochuan...@kik.com> > > >> wrote: > > >> > > >> > Hi, > > >> > > > >> > We are running into a problem where it seems to take a very long > time > > to > > >> > restart a Samza job. > > >> > We are using Samza 0.9.1 at the moment. > > >> > > > >> > From the logs for a particular container it looks like it has > > something > > >> to > > >> > do with reading checkpoints from Kafka: > > >> > > > >> > 2017-09-20 03:21:02.060 INFO o.a.s.c.kafka.KafkaCheckpointManager > > >> [main] > > >> > - > > >> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and > > >> > partition 0. Attempting to fetch messages for checkpoint log. > > >> > 2017-09-20 03:21:02.072 INFO o.a.s.c.kafka.KafkaCheckpointManager > > >> [main] > > >> > - > > >> > Get latest offset 42890599 for topic > > >> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0. > > >> > > > >> > Looking at this line in KafkaCheckpointManager > > >> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/ > > >> > src/main/scala/org/apache/samza/checkpoint/kafka/ > > >> > KafkaCheckpointManager.scala#L275>, > > >> > it seems to indicate that the loop iterates from 0 to 42890599 and > > make > > >> > requests for each. > > >> > > > >> > Questions: > > >> > 1. What does that loop do exactly? > > >> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..." > > >> normal? > > >> > 3. Any ideas on how to fix this? > > >> > > > >> > Thanks, > > >> > Xiaochuan Yu > > >> > > > >> > > >> > > >> > > >> -- > > >> Jagadish V, > > >> Graduate Student, > > >> Department of Computer Science, > > >> Stanford University > > >> > > > > > > -- All the best Liu Bo