Hi everyone,

Yesterday we had lots of network failures running our Kafka cluster
(0.9.0.1 ~40 nodes). We run everything using the higher durability settings
in order to avoid in data loss, producers use all/-1 ack, topics/brokers
have min insync replicas = 2. unclean leader election = false, and all
topics have 3 replicas.

This isn't the first time this has happened to us. When trying to bring the
cluster back online brokers would die on start up with,

2016-08-22 16:49:34,365 FATAL kafka.server.ReplicaFetcherThread:
[ReplicaFetcherThread-2-6], Halting because log truncation is not allowed
for topic XXX, Current leader 6's latest offset 333005055 is less than
replica 31's latest offset 333005155

In this case the broker we were starting (31) had a higher offset then the
running broker (6).

Our team ended up just trying all different combinations of start orders to
get the cluster back online. They managed to get most of them back online
doing this but struggled with the last couple where they had to copy kafka
log files for the partition that was giving us troubles from the 2
previously in sync with higher offsets to the broker with lower offset but
was elected leader.

Our guess with why we had so much trouble during start up was it seemed
with so many partitions and a replication factor of 3 we had this spider
web of partitions and possibly dead lock where some brokers would be the
appropriate in-sync leaders but not in-sync for other partitions which
would cause brokers to fail on start up.

So from this do you have any suggestions on what we could do better next
time?

Also doesn't the fact kafka elected a broker as leader with a lower offset
mean an unclean leader election occurred? It seems like we are only saved
by the error message on the other broker's during start up indicating it
happened and the fact we set min insync replicas to 2 / acks = -1/all,
otherwise writes could come in to that leader and then the offset could be
higher and I would imagine no error would occur.

Reply via email to