Re: Networking errors and durability settings

Bryan Baugher Fri, 26 Aug 2016 06:54:31 -0700

We didn't suffer any data loss nor was there any power outage that I know
of.


On Fri, Aug 26, 2016 at 5:14 AM Khurrum Nasim <khurrumnas...@gmail.com>
wrote:

> On Tue, Aug 23, 2016 at 9:00 AM, Bryan Baugher <bjb...@gmail.com> wrote:
> >
> > > Hi everyone,
> > >
> > > Yesterday we had lots of network failures running our Kafka cluster
> > > (0.9.0.1 ~40 nodes). We run everything using the higher durability
> > settings
> > > in order to avoid in data loss, producers use all/-1 ack,
> topics/brokers
> > > have min insync replicas = 2. unclean leader election = false, and all
> > > topics have 3 replicas.
> >
>
> We also hit a few similar data loss issues before. It made us concerned
> about putting critical data into Kafka.
> Apache DistributedLog seems to be very cool at durability and strong
> consistency. We are actually evaluating
> it as kafka's backend.
>
> - KN
>
>
> > >
> > > This isn't the first time this has happened to us. When trying to bring
> > the
> > > cluster back online brokers would die on start up with,
> > >
> > > 2016-08-22 16:49:34,365 FATAL kafka.server.ReplicaFetcherThread:
> > > [ReplicaFetcherThread-2-6], Halting because log truncation is not
> allowed
> > > for topic XXX, Current leader 6's latest offset 333005055 is less than
> > > replica 31's latest offset 333005155
> > >
> > > In this case the broker we were starting (31) had a higher offset then
> > the
> > > running broker (6).
> > >
> > > Our team ended up just trying all different combinations of start
> orders
> > to
> > > get the cluster back online. They managed to get most of them back
> online
> > > doing this but struggled with the last couple where they had to copy
> > kafka
> > > log files for the partition that was giving us troubles from the 2
> > > previously in sync with higher offsets to the broker with lower offset
> > but
> > > was elected leader.
> > >
> > > Our guess with why we had so much trouble during start up was it seemed
> > > with so many partitions and a replication factor of 3 we had this
> spider
> > > web of partitions and possibly dead lock where some brokers would be
> the
> > > appropriate in-sync leaders but not in-sync for other partitions which
> > > would cause brokers to fail on start up.
> > >
> > > So from this do you have any suggestions on what we could do better
> next
> > > time?
> > >
> > > Also doesn't the fact kafka elected a broker as leader with a lower
> > offset
> > > mean an unclean leader election occurred? It seems like we are only
> saved
> > > by the error message on the other broker's during start up indicating
> it
> > > happened and the fact we set min insync replicas to 2 / acks = -1/all,
> > > otherwise writes could come in to that leader and then the offset could
> > be
> > > higher and I would imagine no error would occur.
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: Networking errors and durability settings

Reply via email to