Re: Networking errors and durability settings

Guozhang Wang Thu, 25 Aug 2016 17:47:44 -0700

Hello Bryan,

I think you were encountering
https://issues.apache.org/jira/browse/KAFKA-3410. Maybe you can take a look
on this ticket and see if it matches your scenario.


Guozhang

On Tue, Aug 23, 2016 at 9:00 AM, Bryan Baugher <bjb...@gmail.com> wrote:

> Hi everyone,
>
> Yesterday we had lots of network failures running our Kafka cluster
> (0.9.0.1 ~40 nodes). We run everything using the higher durability settings
> in order to avoid in data loss, producers use all/-1 ack, topics/brokers
> have min insync replicas = 2. unclean leader election = false, and all
> topics have 3 replicas.
>
> This isn't the first time this has happened to us. When trying to bring the
> cluster back online brokers would die on start up with,
>
> 2016-08-22 16:49:34,365 FATAL kafka.server.ReplicaFetcherThread:
> [ReplicaFetcherThread-2-6], Halting because log truncation is not allowed
> for topic XXX, Current leader 6's latest offset 333005055 is less than
> replica 31's latest offset 333005155
>
> In this case the broker we were starting (31) had a higher offset then the
> running broker (6).
>
> Our team ended up just trying all different combinations of start orders to
> get the cluster back online. They managed to get most of them back online
> doing this but struggled with the last couple where they had to copy kafka
> log files for the partition that was giving us troubles from the 2
> previously in sync with higher offsets to the broker with lower offset but
> was elected leader.
>
> Our guess with why we had so much trouble during start up was it seemed
> with so many partitions and a replication factor of 3 we had this spider
> web of partitions and possibly dead lock where some brokers would be the
> appropriate in-sync leaders but not in-sync for other partitions which
> would cause brokers to fail on start up.
>
> So from this do you have any suggestions on what we could do better next
> time?
>
> Also doesn't the fact kafka elected a broker as leader with a lower offset
> mean an unclean leader election occurred? It seems like we are only saved
> by the error message on the other broker's during start up indicating it
> happened and the fact we set min insync replicas to 2 / acks = -1/all,
> otherwise writes could come in to that leader and then the offset could be
> higher and I would imagine no error would occur.
>



-- 
-- Guozhang

Re: Networking errors and durability settings

Reply via email to