Thanks, Chris. So what is causing the consumer to get stuck is a side effect of the built-in partition assignment in Kafka and by overriding that behaviour I should be able to address the long-running job issue, is that right? Can you please elaborate more on this?
Regards, Ali On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ctoo...@gmail.com> wrote: > You really have to decide what behavior it is you want when one of your > consumers gets "stuck". If you don't like the way the group protocol > dynamically manages topic partition assignments or can't figure out an > appropriate set of configuration settings that achieve your goal, you can > always elect to not use the group protocol and instead manage topic > partition assignment yourself. As I just replied to another post, there's a > nice writeup of this under "Manual Partition Assignment" in > > https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html > . > > Chris > > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <alinazem...@gmail.com> > wrote: > > > To help understanding my case in more details, the error I can see > > constantly is the consumer losing heartbeat and hence apparently the > group > > get rebalanced based on the log I can see from Kafka side: > > > > GroupCoordinator 11]: Member > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has failed, > > removing it from the group > > > > Thanks, > > Ali > > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <alinazem...@gmail.com> > wrote: > > > > > Hi, > > > > > > With the emerge of using Apache Kafka for event-driven architecture, > one > > > thing that has become important is how to tune apache Kafka consumer to > > > manage long-running jobs. The main issue raises when we set a > relatively > > > large value for "max.poll.interval.ms". Setting this value will, of > > > course, resolve the issue of repetitive rebalance, but creates another > > > operational issue. I am looking for some sort of golden strategy to > deal > > > with long-running jobs with Apache Kafka. > > > > > > If the consumer hangs for whatever reason, there is no easy way of > > passing > > > that stage. It can easily block the pipeline, and you cannot do much > > about > > > it. Therefore, it came to my mind that I am probably missing something > > > here. What are the expectations? Is it not valid to use Apache Kafka > for > > > long-live jobs? Are there any other parameters need to be set, and the > > > issue of a consumer being stuck is caused by misconfiguration? > > > > > > I can see there are a lot of the same issues have been raised regarding > > > "the consumer is stuck" and usually, the answer has been "yeah, that's > > > because you have a long-running job, etc.". I have seen different > > > suggestions: > > > > > > - Avoid using long-running jobs. Read the message, submit it into > another > > > thread and let the consumer to pass. Obviously this can cause data loss > > and > > > it would be a difficult problem to handle. It might be better to avoid > > > using Kafka in the first place for these types of requests. > > > > > > - Avoid using apache Kafka for long-running requests > > > > > > - Workaround based approaches like if the consumer is blocked, try to > use > > > another consumer group and set the offset to the current value for the > > new > > > consumer group, etc. > > > > > > There might be other suggestions I have missed here, but that is not > the > > > point of this email. What I am looking for is what is the best practice > > for > > > dealing with long-running jobs with Apache Kafka. I cannot easily avoid > > > using Kafka because it plays a critical part in our application and > data > > > pipeline. On the other side, we have had so many challenges to keep the > > > long-running jobs stable operationally. So I would appreciate it if > > someone > > > can help me to understand what approach can be taken to deal with these > > > jobs with Apache Kafka as a message broker. > > > > > > Thanks, > > > Ali > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian