RE: Re: ZK and Kafka failover testing

Shrikant Patel Wed, 19 Apr 2017 07:57:21 -0700

While we were testing, our producer had following configuration 
max.in.flight.requests.per.connection=1, acks= all and retries=3.


The entire producer side set is below. The consumer has manual offset commit, 
it commit offset after it has successfully processed the message.

Producer setting
bootstrap.servers= {point the F5 VS fronting Kafka cluster}
key.serializer= {appropriate value as per your cases}
value.serializer= {appropriate value as per your case}
acks= all
retries=3
ssl.key.password= {appropriate value as per your case}
ssl.keystore.location= {appropriate value as per your case}
ssl.keystore.password= {appropriate value as per your case}
ssl.truststore.location= {appropriate value as per your case}
ssl.truststore.password= {appropriate value as per your case}
batch.size=16384
client.id= {appropriate value as per your case, may help with debugging}
max.block.ms=65000
request.timeout.ms=30000
security.protocol= SSL
ssl.enabled.protocols=TLSv1.2
ssl.keystore.type=JKS
ssl.protocol=TLSv1.2
ssl.truststore.type=JKS
max.in.flight.requests.per.connection=1
metadata.fetch.timeout.ms=60000
reconnect.backoff.ms=1000
retry.backoff.ms=1000
max.request.size=1048576
linger.ms=0

Consumer setting
bootstrap.servers= {point the F5 VS fronting Kafka cluster}
key.deserializer= {appropriate value as per your cases}
value.deserializer= {appropriate value as per your case}
group.id= {appropriate value as per your case}
ssl.key.password= {appropriate value as per your case}
ssl.keystore.location= {appropriate value as per your case}
ssl.keystore.password= {appropriate value as per your case}
ssl.truststore.location= {appropriate value as per your case}
ssl.truststore.password= {appropriate value as per your case}
enable.auto.commit=false
security.protocol= SSL
ssl.enabled.protocols=TLSv1.2
ssl.keystore.type=JKS
ssl.protocol=TLSv1.2
ssl.truststore.type=JKS
client.id= {appropriate value as per your case, may help with debugging}
reconnect.backoff.ms=1000
retry.backoff.ms=1000

Thanks,
Shri

-----Original Message-----
From: Hans Jespersen [mailto:h...@confluent.io]
Sent: Tuesday, April 18, 2017 7:57 PM
To: users@kafka.apache.org
Subject: [EXTERNAL] Re: ZK and Kafka failover testing

***** Notice: This email was received from an external source *****

When you publish, is acks=0,1 or all (-1)?
What is max.in.flight.requests.per.connection (default is 5)?

It sounds to me like your publishers are using acks=0 and so they are not 
actually succeeding in publishing (i.e. you are getting no acks) but they will 
retry over and over and will have up to 5 retries in flight, so when the broker 
comes back up, you are getting 4 or 5 copies of the same message.

Try setting max.in.flight.requests.per.connection=1 to get rid of duplicates 
Try setting acks=all to ensure the messages are being persisted by the leader 
and all the available replicas in the kafka cluster.

-hans

/**
 * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
 * h...@confluent.io (650)924-2670
 */

On Tue, Apr 18, 2017 at 4:10 PM, Shrikant Patel <spa...@pdxinc.com> wrote:

> Hi All,
>
> I am seeing strange behavior between ZK and Kafka. We have 5 node in
> ZK and Kafka cluster each. Kafka version - 2.11-0.10.1.1
>
> The min.insync.replicas is 3, replication.factor is 5 for all topics,
> unclean.leader.election.enable is false. We have 15 partitions for
> each topic.
>
> The step we are following in our testing.
>
>
> *         My understanding is that ZK needs aleast 3 out of 5 server to be
> functional. Kafka could not be functional without zookeeper. In out
> testing, we bring down 3 ZK nodes and don't touch Kafka nodes. Kafka
> is still functional, consumer\producer can still consume\publish from
> Kafka cluster. We then bring down all ZK nodes, Kafka
> consumer\producers are still functional. I am not able to understand
> why Kafka cluster is not failing as soon as majority of ZK nodes are
> down. I do see error in Kafka that it cannot connection to ZK cluster.
>
>
>
> *         With all or majority of ZK node down, we bring down 1 Kafka
> nodes (out of 5, so 4 are running). And at that point the consumer and
> producer start failing. My guess is the new leadership election cannot
> happen without ZK.
>
>
>
> *         Then we bring up the majority of ZK node up. (1st Kafka is still
> down) Now the Kafka cluster become functional, consumer and producer
> now start working again. But Consumer sees big junk of message from
> kafka, and many of them are duplicates. It's like these messages were
> held up somewhere, Where\Why I don't know?  And why the duplicates? I
> can understand few duplicates for messages that consumer would not
> commit before 1st node when down. But why so many duplicates and like
> 4 copy for each message. I cannot understand this behavior.
>
> Appreciate some insight about our issues. Also if there are blogs that
> describe the ZK and Kafka failover scenario behaviors, that would be
> extremely helpful.
>
> Thanks,
> Shri
>
> This e-mail and its contents (to include attachments) are the property
> of National Health Systems, Inc., its subsidiaries and affiliates,
> including but not limited to Rx.com Community Healthcare Network, Inc.
> and its subsidiaries, and may contain confidential and proprietary or
> privileged information. If you are not the intended recipient of this
> e-mail, you are hereby notified that any unauthorized disclosure,
> copying, or distribution of this e-mail or of its attachments, or the
> taking of any unauthorized action based on information contained herein is 
> strictly prohibited.
> Unauthorized use of information contained herein may subject you to
> civil and criminal prosecution and penalties. If you are not the
> intended recipient, please immediately notify the sender by telephone
> at
> 800-433-5719 or return e-mail and permanently delete the original e-mail.
>
This e-mail and its contents (to include attachments) are the property of 
National Health Systems, Inc., its subsidiaries and affiliates, including but 
not limited to Rx.com Community Healthcare Network, Inc. and its subsidiaries, 
and may contain confidential and proprietary or privileged information. If you 
are not the intended recipient of this e-mail, you are hereby notified that any 
unauthorized disclosure, copying, or distribution of this e-mail or of its 
attachments, or the taking of any unauthorized action based on information 
contained herein is strictly prohibited. Unauthorized use of information 
contained herein may subject you to civil and criminal prosecution and 
penalties. If you are not the intended recipient, please immediately notify the 
sender by telephone at 800-433-5719 or return e-mail and permanently delete the 
original e-mail.

RE: Re: ZK and Kafka failover testing

Reply via email to