Is this maybe related to https://issues.apache.org/jira/browse/KAFKA-3686 
or https://issues.apache.org/jira/browse/KAFKA-4089 ?

Thanks,
- Sini



From:   Peter Sinoros Szabo/Hungary/IBM
To:     users@kafka.apache.org
Date:   2017/05/25 17:01
Subject:        vpn vs TimeoutException


Hi,

Please help me to understand the following situation and to fix the 
problem.
My servers needs a vpn connection to access the kafka brokers. The vpn 
connection restarts periodically and in some of these cases, the kafka 
producer send call's callback return with TimeoutExceptions. I'd like to 
prevent these exceptions and rely on the kafka client's retry feature to 
deliver the messages reliably.

>From the logs I see that the vpn connection restarts in 4 seconds between 
May 25 05:30:55 and May 25 05:30:59 while it runs several ip command like 
"ip route del..., ip add del..., ip link set..., ip addr add..., ip route 
add..."  and after that from 05:31:19.793 I get kafka errors, see logs 
below. 

What I do not understand is that the first NETWORK_EXCEPTION comes much 
after the VPN is operational again, it says that it will retry and after 
that I got the TimeoutExceptions. From the retry count (119 left) I 
thought it would keep retrying for a while, backoff is 300ms, so it should 
take 36 seconds. But the first TimeoutException is 1-7 seconds after the 
NETWORK_EXCEPTION message. Maybe those two are unrelated. I suppose the 
TCP connection is gone while the vpn connection restarts, but it seems 
that the kafka client does not try to reestablish the connection, I do not 
know if it should, but it would be great. Other clients from other 
machines are fine, brokers is ok also.

So please shed some light on this, mostly if this can be avoided with some 
simple configuration :), that would be great.

The Kafka client related logs:
17/05/25 05:31:19.793 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282068 on topic-partition uhs-updates-7, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.793 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282067 on topic-partition uhs-updates-1, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.793 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282066 on topic-partition uhs-updates-7, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.793 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282065 on topic-partition uhs-updates-1, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.794 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282063 on topic-partition uhs-updates-1, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.794 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282077 on topic-partition uhs-updates-5, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.794 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282076 on topic-partition uhs-updates-8, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.794 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282075 on topic-partition uhs-updates-5, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.794 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282074 on topic-partition uhs-updates-2, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:19.794 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282064 on topic-partition uhs-updates-2, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:25.539 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282073 on topic-partition uhs-updates-9, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:25.539 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282072 on topic-partition uhs-updates-6, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:25.539 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282071 on topic-partition uhs-updates-9, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:25.539 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282070 on topic-partition uhs-updates-6, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:25.539 WARN  [kafka-producer-network-thread | producer-1] 
[] o.a.k.c.producer.internals.Sender - Got error produce response with 
correlation id 34282069 on topic-partition uhs-updates-3, retrying (119 
attempts left). Error: NETWORK_EXCEPTION
17/05/25 05:31:26.407 ERROR [kafka-producer-network-thread | producer-1] 
[] t.ustream.kafka.client.AsyncProducer - uhs-updates:[22789631] 
org.apache.kafka.common.errors.TimeoutException
org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for 
uhs-updates-0 due to 30003 ms has passed since batch creation plus linger 
time

This last line repeats about 200 times, with record numbers varying 
between 48 and 51, partition numbers are [0,3,6,9] and the ms value is 
somewhat larger then 30000. 22789631 is the message key, uhs-updates is 
the topic name. For the full list see: https://pastebin.com/raw/4Yz1yTJU

Thanks,
- Sini



Reply via email to