Yes its quite likely we saw many zk session losses for the brokers around
the same time. I'll keep an eye on that JIRA and let you know if we come up
with anything else

On Fri, Aug 26, 2016 at 11:44 AM Jun Rao <j...@confluent.io> wrote:

> Bryan,
>
> Were there multiple brokers losing ZK session around the same time? There
> is one known issue https://issues.apache.org/jira/browse/KAFKA-1211.
> Basically, if the leader changes too quickly, it's possible for a follower
> to truncate some previous committed messages and then immediately becomes
> the new leader. This can potentially cause the FATAL error. We do plan to
> fix KAFKA-1211 in the future, but it may take some time.
>
> Thanks,
>
> Jun
>
> On Fri, Aug 26, 2016 at 6:53 AM, Bryan Baugher <bjb...@gmail.com> wrote:
>
> > We didn't suffer any data loss nor was there any power outage that I know
> > of.
> >
> > On Fri, Aug 26, 2016 at 5:14 AM Khurrum Nasim <khurrumnas...@gmail.com>
> > wrote:
> >
> > > On Tue, Aug 23, 2016 at 9:00 AM, Bryan Baugher <bjb...@gmail.com>
> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Yesterday we had lots of network failures running our Kafka cluster
> > > > > (0.9.0.1 ~40 nodes). We run everything using the higher durability
> > > > settings
> > > > > in order to avoid in data loss, producers use all/-1 ack,
> > > topics/brokers
> > > > > have min insync replicas = 2. unclean leader election = false, and
> > all
> > > > > topics have 3 replicas.
> > > >
> > >
> > > We also hit a few similar data loss issues before. It made us concerned
> > > about putting critical data into Kafka.
> > > Apache DistributedLog seems to be very cool at durability and strong
> > > consistency. We are actually evaluating
> > > it as kafka's backend.
> > >
> > > - KN
> > >
> > >
> > > > >
> > > > > This isn't the first time this has happened to us. When trying to
> > bring
> > > > the
> > > > > cluster back online brokers would die on start up with,
> > > > >
> > > > > 2016-08-22 16:49:34,365 FATAL kafka.server.ReplicaFetcherThread:
> > > > > [ReplicaFetcherThread-2-6], Halting because log truncation is not
> > > allowed
> > > > > for topic XXX, Current leader 6's latest offset 333005055 is less
> > than
> > > > > replica 31's latest offset 333005155
> > > > >
> > > > > In this case the broker we were starting (31) had a higher offset
> > then
> > > > the
> > > > > running broker (6).
> > > > >
> > > > > Our team ended up just trying all different combinations of start
> > > orders
> > > > to
> > > > > get the cluster back online. They managed to get most of them back
> > > online
> > > > > doing this but struggled with the last couple where they had to
> copy
> > > > kafka
> > > > > log files for the partition that was giving us troubles from the 2
> > > > > previously in sync with higher offsets to the broker with lower
> > offset
> > > > but
> > > > > was elected leader.
> > > > >
> > > > > Our guess with why we had so much trouble during start up was it
> > seemed
> > > > > with so many partitions and a replication factor of 3 we had this
> > > spider
> > > > > web of partitions and possibly dead lock where some brokers would
> be
> > > the
> > > > > appropriate in-sync leaders but not in-sync for other partitions
> > which
> > > > > would cause brokers to fail on start up.
> > > > >
> > > > > So from this do you have any suggestions on what we could do better
> > > next
> > > > > time?
> > > > >
> > > > > Also doesn't the fact kafka elected a broker as leader with a lower
> > > > offset
> > > > > mean an unclean leader election occurred? It seems like we are only
> > > saved
> > > > > by the error message on the other broker's during start up
> indicating
> > > it
> > > > > happened and the fact we set min insync replicas to 2 / acks =
> > -1/all,
> > > > > otherwise writes could come in to that leader and then the offset
> > could
> > > > be
> > > > > higher and I would imagine no error would occur.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Reply via email to