Unable to start cluster after crash (0.8.2.2)

Anthony Sparks Wed, 24 Feb 2016 07:49:42 -0800

Hello,

Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed
and running) crashed, and actually out of the 6 processes only one
Zookeeper instance remained alive.  The logs do not indicate much, the only
errors shown were:


2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure)
27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs]
139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00,
real=0.01 secs]

These errors were both in the Zookeeper and the Kafka logs, and it appears
they have been happening everyday (with no impact on Kafka, except for
maybe now?).

The crash is concerning, but not as concerning as what we are encountering
right now.  I am unable to get the cluster back up.  Two of the three nodes
halt with this fatal error:

[2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting because
log truncation is not allowed for topic audit_data, Current leader 0's
latest offset 52844816 is less than replica 1's latest offset 52844835
(kafka.server.ReplicaFetcherThread)

The other node that manages to stay alive is unable to fulfill writes
because we have min.ack set to 2 on the producers (requiring at least two
nodes to be available).  We could change this, but that doesn't fix our
overall problem.

In browsing the Kafka code, in ReplicaFetcherThread.scala there is this
little nugget:

// Prior to truncating the follower's log, ensure that doing so is not
disallowed by the configuration for unclean leader election.
// This situation could only happen if the unclean election configuration
for a topic changes while a replica is down. Otherwise,
// we should never encounter this situation since a non-ISR leader cannot
be elected if disallowed by the broker configuration.
if (!LogConfig.fromProps(brokerConfig.toProps,
AdminUtils.fetchTopicConfig(replicaMgr.zkClient,
topicAndPartition.topic)).uncleanLeaderElectionEnable) {
    // Log a fatal error and shutdown the broker to ensure that data loss
does not unexpectedly occur.
    fatal("Halting because log truncation is not allowed for topic
%s,".format(topicAndPartition.topic) +
      " Current leader %d's latest offset %d is less than replica %d's
latest offset %d"
      .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
replica.logEndOffset.messageOffset))
    Runtime.getRuntime.halt(1)
}

For each one of our Kafka instances we have them set at:
*unclean.leader.election.enable=false
*which hasn't changed at all since we deployed the cluster (verified by
file modification stamps).  This to me would indicate the above comment
assertion is incorrect; we have encountered a non-ISR leader elected even
though it is configured not to do so.

Any ideas on how to work around this?

Thank you,

Tony Sparks

Unable to start cluster after crash (0.8.2.2)

Reply via email to