Re: Consumer hang-up in case of unclean leader election

Sophie Blee-Goldman Fri, 15 May 2020 09:36:26 -0700

Hey Dmitry,

Can you open a ticket at https://issues.apache.org/jira/issues/ and include
all this information so we can track and look into it?


Thanks!
Sophie

On Fri, May 15, 2020 at 2:26 AM Dmitry Sorokin <dmitry.soro...@gmail.com>
wrote:

> According to documentation, in case if `auto.offset.reset` is set
> to none or not set, the exception is thrown to a client code, allowing to
> handle it in a way that client want.
> In case if one will take a closer look on this mechanism, it will turn out
> that it is not working.
>
> Starting from kafka 2.3 new offset reset negotiation algorithm added
> (org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync)
> During this validation,
> Fetcher `org.apache.kafka.clients.consumer.internals.SubscriptionState` is
> held in `AWAIT_VALIDATION` fetch state.
> This effectively means that fetch requests are not issued and consumption
> stopped.
> In case if unclean leader election is happening during this time,
> `LogTruncationException` is thrown from future listener in method
> `validateOffsetsAsync`.
> The main problem is that this exception (thrown from listener of future) is
> effectively swallowed
> by
> `org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest`
> by this part of code
> ```
> } catch (RuntimeException e) {
>   if (!future.isDone()) {
>     future.raise(e);
>   }
> }
> ```
>
> In the end the result is: The only way to get out of AWAIT_VALIDATION and
> continue consumption is to successfully finish validation, but it can not
> be finished.
> However - consumer is alive, but is consuming nothing. The only way to
> resume consumption is to terminate consumer and start another one.
>
> We discovered this situation by means of kstreams application, where valid
> value of `auto.offset.reset` provided by our code is replaced
> by `None` value for a purpose of position reset
> (org.apache.kafka.streams.processor.internals.StreamThread#create).
> And with kstreams it is even worse, as application may be working, logging
> warn messages of format `Truncation detected for partition ...,` but data
> is not generated for a long time and in the end is lost, making kstreams
> application unreliable.
>
> *Did someone saw it already, maybe there are some ways to reconfigure this
> behavior? I checked code for 2.3, 2.4, trunk client - the bug is still
> there.*
>
> --
> Dmitry Sorokin
> mailto://dmitry.soro...@gmail.com
>

Re: Consumer hang-up in case of unclean leader election

Reply via email to