[ https://issues.apache.org/jira/browse/KAFKA-9261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manikumar reopened KAFKA-9261: ------------------------------ > NPE when updating client metadata > --------------------------------- > > Key: KAFKA-9261 > URL: https://issues.apache.org/jira/browse/KAFKA-9261 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Major > Fix For: 2.4.0, 2.3.2 > > > We have seen the following exception recently: > {code} > java.lang.NullPointerException > at java.base/java.util.Objects.requireNonNull(Objects.java:221) > at org.apache.kafka.common.Cluster.<init>(Cluster.java:134) > at org.apache.kafka.common.Cluster.<init>(Cluster.java:89) > at > org.apache.kafka.clients.MetadataCache.computeClusterView(MetadataCache.java:120) > at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:82) > at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:58) > at > org.apache.kafka.clients.Metadata.handleMetadataResponse(Metadata.java:325) > at org.apache.kafka.clients.Metadata.update(Metadata.java:252) > at > org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1059) > at > org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:845) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:548) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201) > {code} > The client assumes that if a leader is included in the response, then node > information must also be available. There are at least a couple possible > reasons this assumption can fail: > 1. The client is able to detect stale partition metadata using leader epoch > information available. If stale partition metadata is detected, the client > ignores it and uses the last known metadata. However, it cannot detect stale > broker information and will always accept the latest update. This means that > the latest metadata may be a mix of multiple metadata responses and therefore > the invariant will not generally hold. > 2. There is no lock which protects both the fetching of partition metadata > and the live broker when handling a Metadata request. This means an > UpdateMetadata request can arrive concurrently and break the intended > invariant. > It seems case 2 has been possible for a long time, but it should be extremely > rare. Case 1 was only made possible with KIP-320, which added the leader > epoch tracking. It should also be rare, but the window for inconsistent > metadata is probably a bit bigger than the window for a concurrent update. > To fix this, we should make the client more defensive about metadata updates > and not assume that the leader is among the live endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005)