Re: Consumer hang-up in case of unclean leader election

Dmitry Sorokin Mon, 18 May 2020 00:21:24 -0700

Hi!
Done: https://issues.apache.org/jira/browse/KAFKA-10013


пт, 15 мая 2020 г. в 18:36, Sophie Blee-Goldman <sop...@confluent.io>:

> Hey Dmitry,
>
> Can you open a ticket at https://issues.apache.org/jira/issues/ and
> include
> all this information so we can track and look into it?
>
> Thanks!
> Sophie
>
> On Fri, May 15, 2020 at 2:26 AM Dmitry Sorokin <dmitry.soro...@gmail.com>
> wrote:
>
> > According to documentation, in case if `auto.offset.reset` is set
> > to none or not set, the exception is thrown to a client code, allowing to
> > handle it in a way that client want.
> > In case if one will take a closer look on this mechanism, it will turn
> out
> > that it is not working.
> >
> > Starting from kafka 2.3 new offset reset negotiation algorithm added
> >
> (org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync)
> > During this validation,
> > Fetcher `org.apache.kafka.clients.consumer.internals.SubscriptionState`
> is
> > held in `AWAIT_VALIDATION` fetch state.
> > This effectively means that fetch requests are not issued and consumption
> > stopped.
> > In case if unclean leader election is happening during this time,
> > `LogTruncationException` is thrown from future listener in method
> > `validateOffsetsAsync`.
> > The main problem is that this exception (thrown from listener of future)
> is
> > effectively swallowed
> > by
> >
> `org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest`
> > by this part of code
> > ```
> > } catch (RuntimeException e) {
> >   if (!future.isDone()) {
> >     future.raise(e);
> >   }
> > }
> > ```
> >
> > In the end the result is: The only way to get out of AWAIT_VALIDATION and
> > continue consumption is to successfully finish validation, but it can not
> > be finished.
> > However - consumer is alive, but is consuming nothing. The only way to
> > resume consumption is to terminate consumer and start another one.
> >
> > We discovered this situation by means of kstreams application, where
> valid
> > value of `auto.offset.reset` provided by our code is replaced
> > by `None` value for a purpose of position reset
> > (org.apache.kafka.streams.processor.internals.StreamThread#create).
> > And with kstreams it is even worse, as application may be working,
> logging
> > warn messages of format `Truncation detected for partition ...,` but data
> > is not generated for a long time and in the end is lost, making kstreams
> > application unreliable.
> >
> > *Did someone saw it already, maybe there are some ways to reconfigure
> this
> > behavior? I checked code for 2.3, 2.4, trunk client - the bug is still
> > there.*
> >
> > --
> > Dmitry Sorokin
> > mailto://dmitry.soro...@gmail.com
> >
>


-- 
Dmitry Sorokin
mailto://dmitry.soro...@gmail.com

Re: Consumer hang-up in case of unclean leader election

Reply via email to