Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Ali Nazemian Wed, 06 May 2020 21:39:00 -0700

Hi,

With the emerge of using Apache Kafka for event-driven architecture, one
thing that has become important is how to tune apache Kafka consumer to
manage long-running jobs. The main issue raises when we set a relatively
large value for "max.poll.interval.ms". Setting this value will, of course,
resolve the issue of repetitive rebalance, but creates another operational
issue. I am looking for some sort of golden strategy to deal with
long-running jobs with Apache Kafka.


If the consumer hangs for whatever reason, there is no easy way of passing
that stage. It can easily block the pipeline, and you cannot do much about
it. Therefore, it came to my mind that I am probably missing something
here. What are the expectations? Is it not valid to use Apache Kafka for
long-live jobs? Are there any other parameters need to be set, and the
issue of a consumer being stuck is caused by misconfiguration?

I can see there are a lot of the same issues have been raised regarding
"the consumer is stuck" and usually, the answer has been "yeah, that's
because you have a long-running job, etc.". I have seen different
suggestions:

- Avoid using long-running jobs. Read the message, submit it into another
thread and let the consumer to pass. Obviously this can cause data loss and
it would be a difficult problem to handle. It might be better to avoid
using Kafka in the first place for these types of requests.

- Avoid using apache Kafka for long-running requests

- Workaround based approaches like if the consumer is blocked, try to use
another consumer group and set the offset to the current value for the new
consumer group, etc.

There might be other suggestions I have missed here, but that is not the
point of this email. What I am looking for is what is the best practice for
dealing with long-running jobs with Apache Kafka. I cannot easily avoid
using Kafka because it plays a critical part in our application and data
pipeline. On the other side, we have had so many challenges to keep the
long-running jobs stable operationally. So I would appreciate it if someone
can help me to understand what approach can be taken to deal with these
jobs with Apache Kafka as a message broker.

Thanks,
Ali

Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Reply via email to