Hi. I've got a 5 node cluster running Kafka 0.8.1, with 4697 partitions (2 replicas each) across 564 topics. I'm sending it about 1% of our total messaging load now, and several times a day there is a period where 1~1500 partitions have one replica not in sync. Is this normal? If a consumer is reading from a replica that gets deemed "not in sync", does it get redirected to the good replica? Is there a #partitions over which maintenance tasks become infeasible?
Relevant config bits: auto.leader.rebalance.enable=true leader.imbalance.per.broker.percentage=20 leader.imbalance.check.interval.seconds=30 replica.lag.time.max.ms=10000 replica.lag.max.messages=4000 num.replica.fetchers=4 replica.fetch.max.bytes=10485760 Not necessarily correlated to those periods, I see a lot of these errors in the logs: [2014-10-20 21:23:26,999] 21963614 [ReplicaFetcherThread-3-1] ERROR kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-3-1], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 77423; ClientId: ReplicaFetcherThread-3-1; ReplicaId: 2; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: ... And a few of these: [2014-10-20 21:23:39,555] 3467527 [kafka-scheduler-2] ERROR kafka.utils.ZkUtils$ - Conditional update of path /brokers/topics/foo.bar/partitions/3/state with data {"controller_epoch":11,"leader":3,"version":1,"leader_epoch":109,"isr":[3]} and expected version 197 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/foo.bar/partitions/3/state And this one I assume is a client closing the connection non-gracefully, thus should probably be a warning, not an error?: [2014-10-20 21:54:15,599] 23812214 [kafka-processor-9092-3] ERROR kafka.network.Processor - Closing socket for /10.31.0.224 because of error -neil