Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Jason Turim Mon, 11 May 2020 05:35:09 -0700

I am not sure off the top, but since the method is on the consumer my
intuition is that it would pause all the partitions the consumer is reading
from.  I think the best thing to do is write a little test harness app to
verify the behavior.


On Mon, May 11, 2020 at 7:31 AM Ali Nazemian <alinazem...@gmail.com> wrote:

> Hi Jason,
>
> Thank you for the message. It seems quite interesting. So something I am
> not sure about "pause" and "resume" is it works based on a partition
> allocation. What will happen if more partitions are assigned to a single
> consumer? For example, in the case where we have over-partition a Kafka
> topic to reserve some capacity for scaling up in case of burst traffic.
>
> Regards,
> Ali
>
> On Sat, May 9, 2020 at 11:52 PM Jason Turim <ja...@signalvine.com> wrote:
>
> > Hi Ali,
> >
> > You may want to look at using the consumer pause / resume api.  Its a
> > mechanism that allows you to poll without retrieving new messages.
> >
> > I employed this strategy to effectively handle highly variable workloads
> by
> > processing them in a background thread.   First pause the consumer when
> > messages were received.  Then start a background thread to process the
> > data, meanwhile the primary thread continues to poll (it won't retrieve
> > data while the consumer is paused).  Finally, when the bg processing is
> > complete, call resume on the consumer to retrieve the next batch of
> > messages.
> >
> > More info here in the section "Detecting Consumer Failures" -
> >
> >
> https://kafka.apache.org/22/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> >
> > hth,
> > jt
> >
> > On Fri, May 8, 2020, 10:50 PM Chris Toomey <ctoo...@gmail.com> wrote:
> >
> > > What exactly is your understanding of what's happening when you say
> "the
> > > pipeline will be blocked by the group coordinator for up to "
> > > max.poll.interval.ms""? Please explain that.
> > >
> > > There's no universal recipe for "long-running jobs", there's just
> > > particular issues you might be encountering and suggested solutions to
> > > those issues.
> > >
> > >
> > >
> > > On Fri, May 8, 2020 at 7:03 PM Ali Nazemian <alinazem...@gmail.com>
> > wrote:
> > >
> > > > Hi Chris,
> > > >
> > > > I am not sure where I said about the "automatic partition
> > reassignment",
> > > > but what I know here is the side effect of increasing "
> > > > max.poll.interval.ms"
> > > > is if the consumer hangs for whatever reason the pipeline will be
> > blocked
> > > > by the group coordinator for up to "max.poll.interval.ms". So I am
> not
> > > > sure
> > > > if this is because of the automatic partition assignment or something
> > > else.
> > > > What I am looking for is how I can deal with long-running jobs in
> > Apache
> > > > Kafka.
> > > >
> > > > Thanks,
> > > > Ali
> > > >
> > > > On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ctoo...@gmail.com>
> wrote:
> > > >
> > > > > I interpreted your post as saying "when our consumer gets stuck,
> > > Kafka's
> > > > > automatic partition reassignment kicks in and that's problematic
> for
> > > us."
> > > > > Hence I suggested not using the automatic partition assignment,
> which
> > > per
> > > > > my interpretation would address your issue.
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <alinazem...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Thanks, Chris. So what is causing the consumer to get stuck is a
> > side
> > > > > > effect of the built-in partition assignment in Kafka and by
> > > overriding
> > > > > that
> > > > > > behaviour I should be able to address the long-running job issue,
> > is
> > > > that
> > > > > > right? Can you please elaborate more on this?
> > > > > >
> > > > > > Regards,
> > > > > > Ali
> > > > > >
> > > > > > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ctoo...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > You really have to decide what behavior it is you want when one
> > of
> > > > your
> > > > > > > consumers gets "stuck". If you don't like the way the group
> > > protocol
> > > > > > > dynamically manages topic partition assignments or can't figure
> > out
> > > > an
> > > > > > > appropriate set of configuration settings that achieve your
> goal,
> > > you
> > > > > can
> > > > > > > always elect to not use the group protocol and instead manage
> > topic
> > > > > > > partition assignment yourself. As I just replied to another
> post,
> > > > > > there's a
> > > > > > > nice writeup of this under  "Manual Partition Assignment" in
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > > > > > >  .
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <
> > > alinazem...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > To help understanding my case in more details, the error I
> can
> > > see
> > > > > > > > constantly is the consumer losing heartbeat and hence
> > apparently
> > > > the
> > > > > > > group
> > > > > > > > get rebalanced based on the log I can see from Kafka side:
> > > > > > > >
> > > > > > > > GroupCoordinator 11]: Member
> > > > > > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo
> > has
> > > > > > failed,
> > > > > > > > removing it from the group
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ali
> > > > > > > >
> > > > > > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <
> > > alinazem...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > With the emerge of using Apache Kafka for event-driven
> > > > > architecture,
> > > > > > > one
> > > > > > > > > thing that has become important is how to tune apache Kafka
> > > > > consumer
> > > > > > to
> > > > > > > > > manage long-running jobs. The main issue raises when we
> set a
> > > > > > > relatively
> > > > > > > > > large value for "max.poll.interval.ms". Setting this value
> > > will,
> > > > > of
> > > > > > > > > course, resolve the issue of repetitive rebalance, but
> > creates
> > > > > > another
> > > > > > > > > operational issue. I am looking for some sort of golden
> > > strategy
> > > > to
> > > > > > > deal
> > > > > > > > > with long-running jobs with Apache Kafka.
> > > > > > > > >
> > > > > > > > > If the consumer hangs for whatever reason, there is no easy
> > way
> > > > of
> > > > > > > > passing
> > > > > > > > > that stage. It can easily block the pipeline, and you
> cannot
> > do
> > > > > much
> > > > > > > > about
> > > > > > > > > it. Therefore, it came to my mind that I am probably
> missing
> > > > > > something
> > > > > > > > > here. What are the expectations? Is it not valid to use
> > Apache
> > > > > Kafka
> > > > > > > for
> > > > > > > > > long-live jobs? Are there any other parameters need to be
> > set,
> > > > and
> > > > > > the
> > > > > > > > > issue of a consumer being stuck is caused by
> > misconfiguration?
> > > > > > > > >
> > > > > > > > > I can see there are a lot of the same issues have been
> raised
> > > > > > regarding
> > > > > > > > > "the consumer is stuck" and usually, the answer has been
> > "yeah,
> > > > > > that's
> > > > > > > > > because you have a long-running job, etc.". I have seen
> > > different
> > > > > > > > > suggestions:
> > > > > > > > >
> > > > > > > > > - Avoid using long-running jobs. Read the message, submit
> it
> > > into
> > > > > > > another
> > > > > > > > > thread and let the consumer to pass. Obviously this can
> cause
> > > > data
> > > > > > loss
> > > > > > > > and
> > > > > > > > > it would be a difficult problem to handle. It might be
> better
> > > to
> > > > > > avoid
> > > > > > > > > using Kafka in the first place for these types of requests.
> > > > > > > > >
> > > > > > > > > - Avoid using apache Kafka for long-running requests
> > > > > > > > >
> > > > > > > > > - Workaround based approaches like if the consumer is
> > blocked,
> > > > try
> > > > > to
> > > > > > > use
> > > > > > > > > another consumer group and set the offset to the current
> > value
> > > > for
> > > > > > the
> > > > > > > > new
> > > > > > > > > consumer group, etc.
> > > > > > > > >
> > > > > > > > > There might be other suggestions I have missed here, but
> that
> > > is
> > > > > not
> > > > > > > the
> > > > > > > > > point of this email. What I am looking for is what is the
> > best
> > > > > > practice
> > > > > > > > for
> > > > > > > > > dealing with long-running jobs with Apache Kafka. I cannot
> > > easily
> > > > > > avoid
> > > > > > > > > using Kafka because it plays a critical part in our
> > application
> > > > and
> > > > > > > data
> > > > > > > > > pipeline. On the other side, we have had so many challenges
> > to
> > > > keep
> > > > > > the
> > > > > > > > > long-running jobs stable operationally. So I would
> appreciate
> > > it
> > > > if
> > > > > > > > someone
> > > > > > > > > can help me to understand what approach can be taken to
> deal
> > > with
> > > > > > these
> > > > > > > > > jobs with Apache Kafka as a message broker.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Ali
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > A.Nazemian
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
>
>
> --
> A.Nazemian
>


-- 
Jason Turim (he, him & his)
Vice President of
Software Engineering
SignalVine Inc <http://www.signalvine.com>
(m) 415-407-6501

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Reply via email to