Bryan, Were there multiple brokers losing ZK session around the same time? There is one known issue https://issues.apache.org/jira/browse/KAFKA-1211. Basically, if the leader changes too quickly, it's possible for a follower to truncate some previous committed messages and then immediately becomes the new leader. This can potentially cause the FATAL error. We do plan to fix KAFKA-1211 in the future, but it may take some time.
Thanks, Jun On Fri, Aug 26, 2016 at 6:53 AM, Bryan Baugher <bjb...@gmail.com> wrote: > We didn't suffer any data loss nor was there any power outage that I know > of. > > On Fri, Aug 26, 2016 at 5:14 AM Khurrum Nasim <khurrumnas...@gmail.com> > wrote: > > > On Tue, Aug 23, 2016 at 9:00 AM, Bryan Baugher <bjb...@gmail.com> wrote: > > > > > > > Hi everyone, > > > > > > > > Yesterday we had lots of network failures running our Kafka cluster > > > > (0.9.0.1 ~40 nodes). We run everything using the higher durability > > > settings > > > > in order to avoid in data loss, producers use all/-1 ack, > > topics/brokers > > > > have min insync replicas = 2. unclean leader election = false, and > all > > > > topics have 3 replicas. > > > > > > > We also hit a few similar data loss issues before. It made us concerned > > about putting critical data into Kafka. > > Apache DistributedLog seems to be very cool at durability and strong > > consistency. We are actually evaluating > > it as kafka's backend. > > > > - KN > > > > > > > > > > > > This isn't the first time this has happened to us. When trying to > bring > > > the > > > > cluster back online brokers would die on start up with, > > > > > > > > 2016-08-22 16:49:34,365 FATAL kafka.server.ReplicaFetcherThread: > > > > [ReplicaFetcherThread-2-6], Halting because log truncation is not > > allowed > > > > for topic XXX, Current leader 6's latest offset 333005055 is less > than > > > > replica 31's latest offset 333005155 > > > > > > > > In this case the broker we were starting (31) had a higher offset > then > > > the > > > > running broker (6). > > > > > > > > Our team ended up just trying all different combinations of start > > orders > > > to > > > > get the cluster back online. They managed to get most of them back > > online > > > > doing this but struggled with the last couple where they had to copy > > > kafka > > > > log files for the partition that was giving us troubles from the 2 > > > > previously in sync with higher offsets to the broker with lower > offset > > > but > > > > was elected leader. > > > > > > > > Our guess with why we had so much trouble during start up was it > seemed > > > > with so many partitions and a replication factor of 3 we had this > > spider > > > > web of partitions and possibly dead lock where some brokers would be > > the > > > > appropriate in-sync leaders but not in-sync for other partitions > which > > > > would cause brokers to fail on start up. > > > > > > > > So from this do you have any suggestions on what we could do better > > next > > > > time? > > > > > > > > Also doesn't the fact kafka elected a broker as leader with a lower > > > offset > > > > mean an unclean leader election occurred? It seems like we are only > > saved > > > > by the error message on the other broker's during start up indicating > > it > > > > happened and the fact we set min insync replicas to 2 / acks = > -1/all, > > > > otherwise writes could come in to that leader and then the offset > could > > > be > > > > higher and I would imagine no error would occur. > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > >