On Sun, Jan 23, 2022 at 7:51 AM Edward Capriolo <edlinuxg...@gmail.com> wrote:
> Hello, > The bulk of this thread is to discuss > [KAFKA-9893] Configurable TCP connection timeout and improve the initial > metadata fetch - ASF JIRA (apache.org) > <https://issues.apache.org/jira/browse/KAFKA-9893>. IMHO t it should be > considered a BUG FIX and be potentially backported, but others may > disagree. Let me tell you about my environment: > > We run kafka 2.2.X 2.5.x and even 2.6.x. We have clients using > spark-streaming, spring bootdata kafka, and folks just using Kafka producer > directly. clusters 3-12 nodes, 10 topics 48 partitions. > > We actually have a chaos monkey in our UAT environment that shuts down > brokers and entire datacenters/racks of brokers,we have clusters each > day, but simply shutting down brokers does not produce the problem. > > We observed: There is a huge distinction between kafka broker > being down and host being up, and kafka being down and *host being down.* > > Before kafka-9893 the second case is handled poorly. If you look at how > the meta-data connection works it randomly picks hosts from the list, and > sometimes requires 2 random hosts for round trip operations. > > Here is what we did. We had all our apps, spark streaming, spring > boot etc. We went and we shutoff a server physically (pick host 1 in the > metadata boker list and shut it down physical hardware). spark streaming > just > could not go forward getting frequent timeouts and tasks failing. (this may > be due to our number of topics and partitions 7 topics 48 partitions) not > sure. > > The fix is pretty simple. Even latest spark streaming is still only > looking at kafka-clients-2.6.0. We simply updated the kafka-clients > artifact (maven) to 2.7.2 and set the timeout to something like 3 seconds > and the process runs while node down. The good news is that generally > kafka-clients seems > backwards compatible and even things compiled against kafka-clients 2.0 do > not seem to have a problem having kafka-clients swapped in at runtime. > > The other mitigation we are doing is we are introducing a gslb based round > robin load balancers in here (using DNS), we assume this will work well, > but honestly it somewhat defeats the purpose metadata.broker.list. > > Recap: IMHO based on what I have seen I would advise > everyone to update their clients to 2.7.2 and set the timeout defined in > the jira, based on what I have seen (but your mileage may vary). And since > you're probably patching log4j now anyway might as well just update kafka > deps at the same time. > > Please discuss if others have seen this issue, or if this is only something > that affects me. > Also note to be clear: Upgrading to 2.7.2 alone does not fix the issue. We had to set the timeout property to something lower than the default (3 seconds) . The issue happens because if the OS is on but kafka is down TCP reply from port closed is fasterm but if the host is down client settings regarding connection timeouts are a big factor.