While we were testing, our producer had following configuration max.in.flight.requests.per.connection=1, acks= all and retries=3.
The entire producer side set is below. The consumer has manual offset commit, it commit offset after it has successfully processed the message. Producer setting bootstrap.servers= {point the F5 VS fronting Kafka cluster} key.serializer= {appropriate value as per your cases} value.serializer= {appropriate value as per your case} acks= all retries=3 ssl.key.password= {appropriate value as per your case} ssl.keystore.location= {appropriate value as per your case} ssl.keystore.password= {appropriate value as per your case} ssl.truststore.location= {appropriate value as per your case} ssl.truststore.password= {appropriate value as per your case} batch.size=16384 client.id= {appropriate value as per your case, may help with debugging} max.block.ms=65000 request.timeout.ms=30000 security.protocol= SSL ssl.enabled.protocols=TLSv1.2 ssl.keystore.type=JKS ssl.protocol=TLSv1.2 ssl.truststore.type=JKS max.in.flight.requests.per.connection=1 metadata.fetch.timeout.ms=60000 reconnect.backoff.ms=1000 retry.backoff.ms=1000 max.request.size=1048576 linger.ms=0 Consumer setting bootstrap.servers= {point the F5 VS fronting Kafka cluster} key.deserializer= {appropriate value as per your cases} value.deserializer= {appropriate value as per your case} group.id= {appropriate value as per your case} ssl.key.password= {appropriate value as per your case} ssl.keystore.location= {appropriate value as per your case} ssl.keystore.password= {appropriate value as per your case} ssl.truststore.location= {appropriate value as per your case} ssl.truststore.password= {appropriate value as per your case} enable.auto.commit=false security.protocol= SSL ssl.enabled.protocols=TLSv1.2 ssl.keystore.type=JKS ssl.protocol=TLSv1.2 ssl.truststore.type=JKS client.id= {appropriate value as per your case, may help with debugging} reconnect.backoff.ms=1000 retry.backoff.ms=1000 Thanks, Shri -----Original Message----- From: Hans Jespersen [mailto:h...@confluent.io] Sent: Tuesday, April 18, 2017 7:57 PM To: users@kafka.apache.org Subject: [EXTERNAL] Re: ZK and Kafka failover testing ***** Notice: This email was received from an external source ***** When you publish, is acks=0,1 or all (-1)? What is max.in.flight.requests.per.connection (default is 5)? It sounds to me like your publishers are using acks=0 and so they are not actually succeeding in publishing (i.e. you are getting no acks) but they will retry over and over and will have up to 5 retries in flight, so when the broker comes back up, you are getting 4 or 5 copies of the same message. Try setting max.in.flight.requests.per.connection=1 to get rid of duplicates Try setting acks=all to ensure the messages are being persisted by the leader and all the available replicas in the kafka cluster. -hans /** * Hans Jespersen, Principal Systems Engineer, Confluent Inc. * h...@confluent.io (650)924-2670 */ On Tue, Apr 18, 2017 at 4:10 PM, Shrikant Patel <spa...@pdxinc.com> wrote: > Hi All, > > I am seeing strange behavior between ZK and Kafka. We have 5 node in > ZK and Kafka cluster each. Kafka version - 2.11-0.10.1.1 > > The min.insync.replicas is 3, replication.factor is 5 for all topics, > unclean.leader.election.enable is false. We have 15 partitions for > each topic. > > The step we are following in our testing. > > > * My understanding is that ZK needs aleast 3 out of 5 server to be > functional. Kafka could not be functional without zookeeper. In out > testing, we bring down 3 ZK nodes and don't touch Kafka nodes. Kafka > is still functional, consumer\producer can still consume\publish from > Kafka cluster. We then bring down all ZK nodes, Kafka > consumer\producers are still functional. I am not able to understand > why Kafka cluster is not failing as soon as majority of ZK nodes are > down. I do see error in Kafka that it cannot connection to ZK cluster. > > > > * With all or majority of ZK node down, we bring down 1 Kafka > nodes (out of 5, so 4 are running). And at that point the consumer and > producer start failing. My guess is the new leadership election cannot > happen without ZK. > > > > * Then we bring up the majority of ZK node up. (1st Kafka is still > down) Now the Kafka cluster become functional, consumer and producer > now start working again. But Consumer sees big junk of message from > kafka, and many of them are duplicates. It's like these messages were > held up somewhere, Where\Why I don't know? And why the duplicates? I > can understand few duplicates for messages that consumer would not > commit before 1st node when down. But why so many duplicates and like > 4 copy for each message. I cannot understand this behavior. > > Appreciate some insight about our issues. Also if there are blogs that > describe the ZK and Kafka failover scenario behaviors, that would be > extremely helpful. > > Thanks, > Shri > > This e-mail and its contents (to include attachments) are the property > of National Health Systems, Inc., its subsidiaries and affiliates, > including but not limited to Rx.com Community Healthcare Network, Inc. > and its subsidiaries, and may contain confidential and proprietary or > privileged information. If you are not the intended recipient of this > e-mail, you are hereby notified that any unauthorized disclosure, > copying, or distribution of this e-mail or of its attachments, or the > taking of any unauthorized action based on information contained herein is > strictly prohibited. > Unauthorized use of information contained herein may subject you to > civil and criminal prosecution and penalties. If you are not the > intended recipient, please immediately notify the sender by telephone > at > 800-433-5719 or return e-mail and permanently delete the original e-mail. > This e-mail and its contents (to include attachments) are the property of National Health Systems, Inc., its subsidiaries and affiliates, including but not limited to Rx.com Community Healthcare Network, Inc. and its subsidiaries, and may contain confidential and proprietary or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that any unauthorized disclosure, copying, or distribution of this e-mail or of its attachments, or the taking of any unauthorized action based on information contained herein is strictly prohibited. Unauthorized use of information contained herein may subject you to civil and criminal prosecution and penalties. If you are not the intended recipient, please immediately notify the sender by telephone at 800-433-5719 or return e-mail and permanently delete the original e-mail.