[ 
https://issues.apache.org/jira/browse/KAFKA-9261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987293#comment-16987293
 ] 

ASF GitHub Bot commented on KAFKA-9261:
---------------------------------------

hachikuji commented on pull request #7772: KAFKA-9261; Client should handle 
inconsistent leader metadata
URL: https://github.com/apache/kafka/pull/7772
 
 
   This is a reduced scope fix. The purpose of this patch is to ensure that 
partition leader state is kept in sync with broker metadata in `MetadataCache` 
and consequently in `Cluster`. Due to the possibility of metadata event 
reordering, it was possible for this state to be inconsistent which could lead 
to an NPE in some cases. The test case here provides a specific scenario where 
this could happen.
   
   Also see #7770 for additional detail. 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NPE when updating client metadata
> ---------------------------------
>
>                 Key: KAFKA-9261
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9261
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>
> We have seen the following exception recently:
> {code}
> java.lang.NullPointerException
>       at java.base/java.util.Objects.requireNonNull(Objects.java:221)
>       at org.apache.kafka.common.Cluster.<init>(Cluster.java:134)
>       at org.apache.kafka.common.Cluster.<init>(Cluster.java:89)
>       at 
> org.apache.kafka.clients.MetadataCache.computeClusterView(MetadataCache.java:120)
>       at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:82)
>       at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:58)
>       at 
> org.apache.kafka.clients.Metadata.handleMetadataResponse(Metadata.java:325)
>       at org.apache.kafka.clients.Metadata.update(Metadata.java:252)
>       at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1059)
>       at 
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:845)
>       at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:548)
>       at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
>       at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
>       at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
> {code}
> The client assumes that if a leader is included in the response, then node 
> information must also be available. There are at least a couple possible 
> reasons this assumption can fail:
> 1. The client is able to detect stale partition metadata using leader epoch 
> information available. If stale partition metadata is detected, the client 
> ignores it and uses the last known metadata. However, it cannot detect stale 
> broker information and will always accept the latest update. This means that 
> the latest metadata may be a mix of multiple metadata responses and therefore 
> the invariant will not generally hold.
> 2. There is no lock which protects both the fetching of partition metadata 
> and the live broker when handling a Metadata request. This means an 
> UpdateMetadata request can arrive concurrently and break the intended 
> invariant.
> It seems case 2 has been possible for a long time, but it should be extremely 
> rare. Case 1 was only made possible with KIP-320, which added the leader 
> epoch tracking. It should also be rare, but the window for inconsistent 
> metadata is probably a bit bigger than the window for a concurrent update.
> To fix this, we should make the client more defensive about metadata updates 
> and not assume that the leader is among the live endpoints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to