Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Ali Nazemian Fri, 08 May 2020 19:04:18 -0700

Hi Chris,

I am not sure where I said about the "automatic partition reassignment",
but what I know here is the side effect of increasing "max.poll.interval.ms"
is if the consumer hangs for whatever reason the pipeline will be blocked
by the group coordinator for up to "max.poll.interval.ms". So I am not sure
if this is because of the automatic partition assignment or something else.
What I am looking for is how I can deal with long-running jobs in Apache
Kafka.


Thanks,
Ali

On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ctoo...@gmail.com> wrote:

> I interpreted your post as saying "when our consumer gets stuck, Kafka's
> automatic partition reassignment kicks in and that's problematic for us."
> Hence I suggested not using the automatic partition assignment, which per
> my interpretation would address your issue.
>
> Chris
>
> On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <alinazem...@gmail.com> wrote:
>
> > Thanks, Chris. So what is causing the consumer to get stuck is a side
> > effect of the built-in partition assignment in Kafka and by overriding
> that
> > behaviour I should be able to address the long-running job issue, is that
> > right? Can you please elaborate more on this?
> >
> > Regards,
> > Ali
> >
> > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ctoo...@gmail.com> wrote:
> >
> > > You really have to decide what behavior it is you want when one of your
> > > consumers gets "stuck". If you don't like the way the group protocol
> > > dynamically manages topic partition assignments or can't figure out an
> > > appropriate set of configuration settings that achieve your goal, you
> can
> > > always elect to not use the group protocol and instead manage topic
> > > partition assignment yourself. As I just replied to another post,
> > there's a
> > > nice writeup of this under  "Manual Partition Assignment" in
> > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > >  .
> > >
> > > Chris
> > >
> > >
> > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <alinazem...@gmail.com>
> > > wrote:
> > >
> > > > To help understanding my case in more details, the error I can see
> > > > constantly is the consumer losing heartbeat and hence apparently the
> > > group
> > > > get rebalanced based on the log I can see from Kafka side:
> > > >
> > > > GroupCoordinator 11]: Member
> > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has
> > failed,
> > > > removing it from the group
> > > >
> > > > Thanks,
> > > > Ali
> > > >
> > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <alinazem...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > With the emerge of using Apache Kafka for event-driven
> architecture,
> > > one
> > > > > thing that has become important is how to tune apache Kafka
> consumer
> > to
> > > > > manage long-running jobs. The main issue raises when we set a
> > > relatively
> > > > > large value for "max.poll.interval.ms". Setting this value will,
> of
> > > > > course, resolve the issue of repetitive rebalance, but creates
> > another
> > > > > operational issue. I am looking for some sort of golden strategy to
> > > deal
> > > > > with long-running jobs with Apache Kafka.
> > > > >
> > > > > If the consumer hangs for whatever reason, there is no easy way of
> > > > passing
> > > > > that stage. It can easily block the pipeline, and you cannot do
> much
> > > > about
> > > > > it. Therefore, it came to my mind that I am probably missing
> > something
> > > > > here. What are the expectations? Is it not valid to use Apache
> Kafka
> > > for
> > > > > long-live jobs? Are there any other parameters need to be set, and
> > the
> > > > > issue of a consumer being stuck is caused by misconfiguration?
> > > > >
> > > > > I can see there are a lot of the same issues have been raised
> > regarding
> > > > > "the consumer is stuck" and usually, the answer has been "yeah,
> > that's
> > > > > because you have a long-running job, etc.". I have seen different
> > > > > suggestions:
> > > > >
> > > > > - Avoid using long-running jobs. Read the message, submit it into
> > > another
> > > > > thread and let the consumer to pass. Obviously this can cause data
> > loss
> > > > and
> > > > > it would be a difficult problem to handle. It might be better to
> > avoid
> > > > > using Kafka in the first place for these types of requests.
> > > > >
> > > > > - Avoid using apache Kafka for long-running requests
> > > > >
> > > > > - Workaround based approaches like if the consumer is blocked, try
> to
> > > use
> > > > > another consumer group and set the offset to the current value for
> > the
> > > > new
> > > > > consumer group, etc.
> > > > >
> > > > > There might be other suggestions I have missed here, but that is
> not
> > > the
> > > > > point of this email. What I am looking for is what is the best
> > practice
> > > > for
> > > > > dealing with long-running jobs with Apache Kafka. I cannot easily
> > avoid
> > > > > using Kafka because it plays a critical part in our application and
> > > data
> > > > > pipeline. On the other side, we have had so many challenges to keep
> > the
> > > > > long-running jobs stable operationally. So I would appreciate it if
> > > > someone
> > > > > can help me to understand what approach can be taken to deal with
> > these
> > > > > jobs with Apache Kafka as a message broker.
> > > > >
> > > > > Thanks,
> > > > > Ali
> > > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> > --
> > A.Nazemian
> >
>


-- 
A.Nazemian

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Reply via email to