Re: Networking errors and durability settings

Jun Rao Thu, 25 Aug 2016 19:05:00 -0700

Bryan,

https://issues.apache.org/jira/browse/KAFKA-3410 reported a similar issue
but only happened when the leader broker's log was manually deleted. In
your case, was there any data loss in the broker due to things like power
outage?


Thanks,

Jun

On Tue, Aug 23, 2016 at 9:00 AM, Bryan Baugher <bjb...@gmail.com> wrote:

> Hi everyone,
>
> Yesterday we had lots of network failures running our Kafka cluster
> (0.9.0.1 ~40 nodes). We run everything using the higher durability settings
> in order to avoid in data loss, producers use all/-1 ack, topics/brokers
> have min insync replicas = 2. unclean leader election = false, and all
> topics have 3 replicas.
>
> This isn't the first time this has happened to us. When trying to bring the
> cluster back online brokers would die on start up with,
>
> 2016-08-22 16:49:34,365 FATAL kafka.server.ReplicaFetcherThread:
> [ReplicaFetcherThread-2-6], Halting because log truncation is not allowed
> for topic XXX, Current leader 6's latest offset 333005055 is less than
> replica 31's latest offset 333005155
>
> In this case the broker we were starting (31) had a higher offset then the
> running broker (6).
>
> Our team ended up just trying all different combinations of start orders to
> get the cluster back online. They managed to get most of them back online
> doing this but struggled with the last couple where they had to copy kafka
> log files for the partition that was giving us troubles from the 2
> previously in sync with higher offsets to the broker with lower offset but
> was elected leader.
>
> Our guess with why we had so much trouble during start up was it seemed
> with so many partitions and a replication factor of 3 we had this spider
> web of partitions and possibly dead lock where some brokers would be the
> appropriate in-sync leaders but not in-sync for other partitions which
> would cause brokers to fail on start up.
>
> So from this do you have any suggestions on what we could do better next
> time?
>
> Also doesn't the fact kafka elected a broker as leader with a lower offset
> mean an unclean leader election occurred? It seems like we are only saved
> by the error message on the other broker's during start up indicating it
> happened and the fact we set min insync replicas to 2 / acks = -1/all,
> otherwise writes could come in to that leader and then the offset could be
> higher and I would imagine no error would occur.
>

Re: Networking errors and durability settings

Reply via email to