[jira] [Commented] (KAFKA-13840) KafkaConsumer is unable to recover connection to group coordinator after commitOffsetsAsync exception

Kyle R Stehbens (Jira) Tue, 02 Aug 2022 09:22:05 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574313#comment-17574313
 ]


Kyle R Stehbens commented on KAFKA-13840:
-----------------------------------------

Hi, we rolled back our Kafka client to 2.5.1 in all our java apps which is the 
last known good version of the client before the Change set that broke this.

Unfortunately we cannot test this change our from a trunk build in our 
production environments, that's just not going to fly.

 

Personally, I think the change doesn't fully address the issue as it doesn't 
address what I think is the root cause which are the changes before and after 
this line:
[https://github.com/apache/kafka/blob/0efa8fb0f4c73d92b6e55a112fa45417a67a7dc2/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L1284]

The logic here make it so that the call to markCoordinatorUnknown(); at line 
1288 is never called, where before the changes in 2.6.0 this check is done 
earlier around line 1275.

 In our case, the heart beat thread is absolutely running but can not recover 
the the coordinator because of this bug.

 

In my option the 2 opotions here are:

1 - revert the changes added in 2.6.0 that refactored all this code and 
introduced this bug - ostensibly that changes was made to fix a very rare race 
condition and exchanged a rare race condition with a very common to experience 
and far worse bug.

2 - Fix forward the changes and completely remove the findCoordinatorFuture 
variable and all reference to it. This future is only being used to gate calls 
to recovering the co-coordinator and this can be achieved through other means 
like locks or synchronized methods with appropriate condition checking.

 

[~martijnvisser] We downgraded our Kafka client to v 2.5.1 in our flink related 
projects to fix this issue for us.

> KafkaConsumer is unable to recover connection to group coordinator after 
> commitOffsetsAsync exception
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13840
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13840
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 2.6.1, 3.1.0, 2.7.2, 2.8.1, 3.0.0
>            Reporter: Kyle R Stehbens
>            Assignee: Luke Chen
>            Priority: Major
>
> Hi, I've discovered an issue with the java Kafka client (consumer) whereby a 
> timeout or any other retry-able exception triggered during an async offset 
> commit, renders the client unable to recover its group co-coordinator and 
> leaves the client in a broken state.
>  
> I first encountered this using v2.8.1 of the java client, and after going 
> through the code base for all versions of the client, have found it affects 
> all versions of the client from 2.6.1 onward.
> I also confirmed that by rolling back to 2.5.1, the issue is not present.
>  
> The issue stems from changes to how the FindCoordinatorResponseHandler in 
> 2.5.1 used to call clearFindCoordinatorFuture(); on both success and failure 
> here:
> [https://github.com/apache/kafka/blob/0efa8fb0f4c73d92b6e55a112fa45417a67a7dc2/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L783]
>  
> In all future version of the client this call is not made:
> [https://github.com/apache/kafka/blob/839b886f9b732b151e1faeace7303c80641c08c4/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L838]
>  
> What this results in, is when the KafkaConsumer makes a call to 
> coordinator.commitOffsetsAsync(...), if an error occurs such that the 
> coordinator is unavailable here:
> [https://github.com/apache/kafka/blob/c5077c679c372589215a1b58ca84360c683aa6e8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L1007]
>  
> then the client will try call:
> [https://github.com/apache/kafka/blob/c5077c679c372589215a1b58ca84360c683aa6e8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L1017]
> However this will never be able to succeed as it perpetually returns a 
> reference to a failed future: findCoordinatorFuture that is never cleared out.
>  
> This manifests in all future calls to commitOffsetsAsync() throwing a 
> "coordinator unavailable" exception forever going forward after any 
> retry-able exception causes the coordinator to close. 
> Note we discovered this when we upgraded the kafka client in our Flink 
> consumers from 2.4.1 to 2.8.1 and subsequently needed to downgrade the 
> client. We noticed this occurring in our non-flink java consumers too running 
> 3.x client versions.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-13840) KafkaConsumer is unable to recover connection to group coordinator after commitOffsetsAsync exception

Reply via email to