[ https://issues.apache.org/jira/browse/KAFKA-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
KRISHNA SARVEPALLI updated KAFKA-10791: --------------------------------------- Attachment: Topic-Recreated.png > Kafka Metadata older epoch problem > ---------------------------------- > > Key: KAFKA-10791 > URL: https://issues.apache.org/jira/browse/KAFKA-10791 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.2.0 > Environment: Kubernetes cluster, > Reporter: KRISHNA SARVEPALLI > Priority: Major > Attachments: Kafka-Client-Issue.png, Topic-Recreated.png, > zookeeper-leader-epoch.png, zookeeper-state.png > > > We are using Kafka in production with 5 brokers and 3 zookeepers. We are > running Kafka and zookeeper in Kubernetes and storage is managed by PVC using > NFS. We are using topic with 60 partitions. > The cluster was running successfully from almost 50 days since the last > restart. Last week (11/28) two brokers were down. Team is still researching > for the root cause of broker failures. > Since we are using K8s the brokers came back up immediately (in less than > 5minutes). But we have issue on the producer applications and consumer > applications while downloading the metadata. Please check the attached images. > We have enabled debug logs for one of the applications and it seems like > Kafka brokers are returning metadata with leader_epoch value of 0 where as in > the client Metadata cache it was maintained at 6 for most of the partitions. > Eventually we are forced to restart all the producer apps (around 35-40 micro > services) and they are all able to download the metadata since it's first > time didn't face any issue and was able to produce the messages. > As part of troubleshooting, we have checked the zookeeper key/value pairs > registered by Kafka and we can see that leader_epoch was set back to 0 for > almost all partitions. And we have checked for another topic which is used by > other apps, their leader_epoch was in good shape and ctime and mtime are also > updated correctly. Please check the attached screenshots. > Please refer the stackoverflow issue that we have reported: > https://stackoverflow.com/questions/65055299/kafka-producer-not-able-to-download-refresh-metadata-after-brokers-were-restar > > +*Broker Configs:*+ > --override zookeeper.connect=zookeeper:2181 > --override advertised.listeners=PLAINTEXT://kafka,SASL_SSL://kafka > --override log.dirs=/opt/kafka/data/logs > --override broker.id=kafka > --override num.network.threads=3 > --override num.io.threads=8 > --override default.replication.factor=3 > --override auto.create.topics.enable=true > --override delete.topic.enable=true > --override socket.send.buffer.bytes=102400 > --override socket.receive.buffer.bytes=102400 > --override socket.request.max.bytes=104857600 > --override num.partitions=30 > --override num.recovery.threads.per.data.dir=1 > --override offsets.topic.replication.factor=3 > --override transaction.state.log.replication.factor=3 > --override transaction.state.log.min.isr=1 > --override log.retention.hours=48 > --override log.segment.bytes=1073741824 > --override log.retention.check.interval.ms=300000 > --override zookeeper.connection.timeout.ms=6000 > --override confluent.support.metrics.enable=true > --override group.initial.rebalance.delay.ms=0 > --override confluent.support.customer.id=anonymous > --override ssl.truststore.location=kafka.broker.truststore.jks > --override ssl.truststore.password=changeit > --override ssl.keystore.location=kafka.broker.keystore.jks > --override ssl.keystore.password=changeit > --override ssl.keystore.type=PKCS12 > --override ssl.key.password=changeit > --override listeners=SASL_SSL://0.0.0.0:9093,PLAINTEXT://0.0.0.0:9092 > --override authorizer_class_name=kafka.security.auth.SimpleAclAuthorizer > --override ssl.endpoint.identification.algorithm > --override ssl.client.auth=requested > --override sasl.enabled.mechanisms=SCRAM-SHA-512 > --override sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 > --override security.inter.broker.protocol=SASL_SSL > --override super.users=test:test > --override zookeeper.set.acl=false > -- This message was sent by Atlassian Jira (v8.3.4#803005)