[ 
https://issues.apache.org/jira/browse/KAFKA-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KRISHNA SARVEPALLI updated KAFKA-10791:
---------------------------------------
    Attachment: Topic-Recreated.png

> Kafka Metadata older epoch problem
> ----------------------------------
>
>                 Key: KAFKA-10791
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10791
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.2.0
>         Environment: Kubernetes cluster,
>            Reporter: KRISHNA SARVEPALLI
>            Priority: Major
>         Attachments: Kafka-Client-Issue.png, Topic-Recreated.png, 
> zookeeper-leader-epoch.png, zookeeper-state.png
>
>
> We are using Kafka in production with 5 brokers and 3 zookeepers. We are 
> running Kafka and zookeeper in Kubernetes and storage is managed by PVC using 
> NFS. We are using topic with 60 partitions.
> The cluster was running successfully from almost 50 days since the last 
> restart. Last week (11/28) two brokers were down. Team is still researching 
> for the root cause of broker failures. 
> Since we are using K8s the brokers came back up immediately (in less than 
> 5minutes). But we have issue on the producer applications and consumer 
> applications while downloading the metadata. Please check the attached images.
> We have enabled debug logs for one of the applications and it seems like 
> Kafka brokers are returning metadata with leader_epoch value of 0 where as in 
> the client Metadata cache it was maintained at 6 for most of the partitions. 
> Eventually we are forced to restart all the producer apps (around 35-40 micro 
> services) and they are all able to download the metadata since it's first 
> time didn't face any issue and was able to produce the messages.
> As part of troubleshooting, we have checked the zookeeper key/value pairs 
> registered by Kafka and we can see that leader_epoch was set back to 0 for 
> almost all partitions. And we have checked for another topic which is used by 
> other apps, their leader_epoch was in good shape and ctime and mtime are also 
> updated correctly. Please check the attached screenshots.
> Please refer the stackoverflow issue that we have reported:
> https://stackoverflow.com/questions/65055299/kafka-producer-not-able-to-download-refresh-metadata-after-brokers-were-restar
>  
> +*Broker Configs:*+
> --override zookeeper.connect=zookeeper:2181 
>  --override advertised.listeners=PLAINTEXT://kafka,SASL_SSL://kafka
>  --override log.dirs=/opt/kafka/data/logs 
>  --override broker.id=kafka
>  --override num.network.threads=3 
>  --override num.io.threads=8 
>  --override default.replication.factor=3 
>  --override auto.create.topics.enable=true 
>  --override delete.topic.enable=true 
>  --override socket.send.buffer.bytes=102400 
>  --override socket.receive.buffer.bytes=102400 
>  --override socket.request.max.bytes=104857600 
>  --override num.partitions=30 
>  --override num.recovery.threads.per.data.dir=1 
>  --override offsets.topic.replication.factor=3 
>  --override transaction.state.log.replication.factor=3 
>  --override transaction.state.log.min.isr=1 
>  --override log.retention.hours=48 
>  --override log.segment.bytes=1073741824 
>  --override log.retention.check.interval.ms=300000 
>  --override zookeeper.connection.timeout.ms=6000 
>  --override confluent.support.metrics.enable=true 
>  --override group.initial.rebalance.delay.ms=0 
>  --override confluent.support.customer.id=anonymous 
>  --override ssl.truststore.location=kafka.broker.truststore.jks 
>  --override ssl.truststore.password=changeit 
>  --override ssl.keystore.location=kafka.broker.keystore.jks 
>  --override ssl.keystore.password=changeit 
>  --override ssl.keystore.type=PKCS12 
>  --override ssl.key.password=changeit 
>  --override listeners=SASL_SSL://0.0.0.0:9093,PLAINTEXT://0.0.0.0:9092 
>  --override authorizer_class_name=kafka.security.auth.SimpleAclAuthorizer 
>  --override ssl.endpoint.identification.algorithm 
>  --override ssl.client.auth=requested 
>  --override sasl.enabled.mechanisms=SCRAM-SHA-512 
>  --override sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 
>  --override security.inter.broker.protocol=SASL_SSL 
>  --override super.users=test:test
>  --override zookeeper.set.acl=false
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to