Hello everyone,
We are running a 5 broker Kafka v0.10.0.0 cluster on AWS. Also, the connect
api is in v0.10.0.0.
It was observed that the distributed kafka connector went into infinite
loop of log message of

(Re-)joining group connect-connect-elasticsearch-indexer.

And after a little more digging. There was another infinite loop of set of
log messages

*Discovered coordinator 1.2.3.4:9092 (id: ____ rack: null) for group x*

*Marking the coordinator 1.2.3.4:9092 <http://1.2.3.4:9092> (id: ____ rack:
null) dead for group x*

Restarting Kafka connect did not help.

Looking at zookeeper, we realized that broker 1.2.3.4 had left the
zookeeper cluster. It had happened due to a timeout when interacting with
zookeeper. The broker was also the controller. Failover of controller
happened, and the applications were fine, but few days later, we started
facing the above mentioned issue. To add to the surprise, the kafka daemon
was still running on the host but was not trying to contact zookeeper in
any time. Hence, zombie broker.

Also, a connect cluster spread across multiple hosts was not working,
however a single connector worked.

After replacing the EC2 instance for the broker 1.2.3.4, kafka connect
cluster started working fine. (same broker ID)

I am not sure if this is a bug. Kafka connect shouldn't be trying the same
broker if it is not able establish connection. We use consul for service
discovery. As broker was running and 9092 port was active, consul reported
it was working and redirected dns queries to that broker when the connector
tried to connect. We have disabled DNS caching in the java options, and
Kafka connect should've tried to connect to some other host each time it
tried to query, but instead it only tried on 1.2.3.4 broker.

Does kafka connect internally cache DNS? Is there a debugging step I am
missing here?
-- 
Anish Samir Mashankar

Reply via email to