Hi, I'm using Kafka version 0.11.0.2. In my cluster, I've 4 nodes running Kafka of which 3 nodes also running Zookeeper. I've a few producer processes that publish to Kafka and multiple consumer processes, a streaming engine (Spark) that ingests from Kafka and also publishes data to Kafka, and a distributed data store (Druid) which reads all messages from Kafka and stores in the DB. Druid also uses the same Zookeeper cluster being used by Kafka for cluster state management.
*Kafka Configs:* 1) No replication being used 2) Number of network threads 30 3) Number of IO threads 8 4) Machines have 64GB RAM and 16 cores 5) 3 topics with 64 partitions per topic *Questions:* 1) *Partitions going offline* I frequently see partitions going offline because of which the scheduling delay of the Spark application increases and input rate gets jittery. I tried enabling replication too to see if it helped with the problem. It didn't quite make a difference. What could be the cause of this issue? Lack of resources or cluster misconfigurations? Can the cause be large number of receiver processes? *2) Colocation of Zookeeper and Kafka:* As I mentioned above, I'm running 3 nodes with both Zookeeper and Kafka colocated. Both the components are containerized, so they are running inside docker containers. I found a few blogs that suggested not colocating them for performance reasons. Is it necessary to run them on dedicated machines? *3) Using same Zookeeper cluster across different components* In my cluster, I use the same Zookeeper cluster for state management of the Kafka cluster and the Druid cluster. Could this cause instability of the overall system? Hope I've covered all the necessary information needed. Please let me know if more information about my cluster is needed. Thanks in advance, Avinash -- Excuse brevity and typos. Sent from mobile device.