frequent periods of ~1500 replicas not in sync

Neil Harkins Tue, 21 Oct 2014 11:56:29 -0700

Hi. I've got a 5 node cluster running Kafka 0.8.1,
with 4697 partitions (2 replicas each) across 564 topics.
I'm sending it about 1% of our total messaging load now,
and several times a day there is a period where 1~1500
partitions have one replica not in sync. Is this normal?
If a consumer is reading from a replica that gets deemed
"not in sync", does it get redirected to the good replica?
Is there a #partitions over which maintenance tasks
become infeasible?


Relevant config bits:
auto.leader.rebalance.enable=true
leader.imbalance.per.broker.percentage=20
leader.imbalance.check.interval.seconds=30
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
num.replica.fetchers=4
replica.fetch.max.bytes=10485760

Not necessarily correlated to those periods,
I see a lot of these errors in the logs:

[2014-10-20 21:23:26,999] 21963614 [ReplicaFetcherThread-3-1] ERROR
kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-3-1], Error
in fetch Name: FetchRequest; Version: 0; CorrelationId: 77423;
ClientId: ReplicaFetcherThread-3-1; ReplicaId: 2; MaxWait: 500 ms;
MinBytes: 1 bytes; RequestInfo: ...

And a few of these:

[2014-10-20 21:23:39,555] 3467527 [kafka-scheduler-2] ERROR
kafka.utils.ZkUtils$  - Conditional update of path
/brokers/topics/foo.bar/partitions/3/state with data
{"controller_epoch":11,"leader":3,"version":1,"leader_epoch":109,"isr":[3]}
and expected version 197 failed due to
org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/foo.bar/partitions/3/state

And this one I assume is a client closing the connection non-gracefully,
thus should probably be a warning, not an error?:

[2014-10-20 21:54:15,599] 23812214 [kafka-processor-9092-3] ERROR
kafka.network.Processor  - Closing socket for /10.31.0.224 because of
error

-neil

frequent periods of ~1500 replicas not in sync

Reply via email to