[
https://issues.apache.org/jira/browse/KAFKA-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337080#comment-15337080
]
James Cheng commented on KAFKA-3861:
------------------------------------
Semi-related to https://issues.apache.org/jira/browse/KAFKA-3410
> Shrunk ISR before leader crash makes the partition unavailable
> --------------------------------------------------------------
>
> Key: KAFKA-3861
> URL: https://issues.apache.org/jira/browse/KAFKA-3861
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.10.0.0
> Reporter: Maysam Yabandeh
>
> We observed a case that the leader experienced a crash and lost its in-memory
> data and latest HW offsets. Normally Kafka should be safe and be able to make
> progress with a single node failure. However a few seconds before the crash
> the leader shrunk its ISR to itself, which is safe since min-in-sync-replicas
> is 2 and replication factor is 3 thus the troubled leader cannot accept new
> produce messages. After the crash however the controller could not name any
> of the of the followers as the new leader since as far as the controller
> knows they are not in ISR and could potentially be behind the last leader.
> Note that unclean-leader-election is disabled in this cluster since the
> cluster requires a very high degree of durability and cannot tolerate data
> loss.
> The impact could get worse if the admin brings up the crashed broker in an
> attempt to make such partitions available again; this would take down even
> more brokers as the followers panic when they find their offset larger than
> HW offset in the leader:
> {code}
> if (leaderEndOffset < replica.logEndOffset.messageOffset) {
> // Prior to truncating the follower's log, ensure that doing so is not
> disallowed by the configuration for unclean leader election.
> // This situation could only happen if the unclean election
> configuration for a topic changes while a replica is down. Otherwise,
> // we should never encounter this situation since a non-ISR leader
> cannot be elected if disallowed by the broker configuration.
> if (!LogConfig.fromProps(brokerConfig.originals,
> AdminUtils.fetchEntityConfig(replicaMgr.zkUtils,
> ConfigType.Topic,
> topicAndPartition.topic)).uncleanLeaderElectionEnable) {
> // Log a fatal error and shutdown the broker to ensure that data loss
> does not unexpectedly occur.
> fatal("Halting because log truncation is not allowed for topic
> %s,".format(topicAndPartition.topic) +
> " Current leader %d's latest offset %d is less than replica %d's
> latest offset %d"
> .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
> replica.logEndOffset.messageOffset))
> Runtime.getRuntime.halt(1)
> }
> {code}
> One hackish solution would be that the admin investigates the logs, determine
> that unclean-leader-election in this particular case would be safe and
> temporarily enables it (while the crashed node is down) until new leaders are
> selected for affected partitions, wait for the topics LEO advances far enough
> and then brings up the crashed node again. This manual process is however
> slow and error-prone and the cluster will suffer partial unavailability in
> the meanwhile.
> We are thinking of having the controller make an exception for this case: if
> ISR size is less than min-in-sync-replicas and the new leader would be -1,
> then the controller does an RPC to all the replicas and inquire of the latest
> offset, and if all the replicas responded then chose the one with the largest
> offset as the leader as well as the new ISR. Note that the controller cannot
> do that if any of the non-leader replicas do not respond since there might be
> a case that the responding replicas have not been involved the last ISR and
> hence potentially behind the others (and the controller could not know that
> since it does not keep track of previous ISR).
> Pros would be that kafka will be safely available when such cases occur and
> would not require any admin intervention. The cons however is that the
> controller talking to brokers inside the leader election function would break
> the existing pattern in the source code as currently the leader is elected
> locally without requiring any additional RPC.
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)