[ 
https://issues.apache.org/jira/browse/KAFKA-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-1647:
---------------------------------
    Description: 
We ran into this scenario recently in a production environment. This can happen 
when enough brokers in a cluster are taken down. i.e., a rolling bounce done 
properly should not cause this issue. It can occur if all replicas for any 
partition are taken down.

Here is a sample scenario:

* Cluster of three brokers: b0, b1, b2
* Two partitions (of some topic) with replication factor two: p0, p1
* Initial state:
p0: leader = b0, ISR = {b0, b1}
p1: leader = b1, ISR = {b0, b1}
* Do a parallel hard-kill of all brokers
* Bring up b2, so it is the new controller
* b2 initializes its controller context and populates its leader/ISR cache 
(i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last 
known leaders are b0 (for p0) and b1 (for p2)
* Bring up b1
* The controller's onBrokerStartup procedure initiates a replica state change 
for all replicas on b1 to become online. As part of this replica state change 
it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 
(for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: 
leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not 
included in the leaders field because b0 is down.
* On receiving the LeaderAndIsrRequest, b1's replica manager will successfully 
make itself (b1) the leader for p1 (and create the local replica object 
corresponding to p1). It will however abort the become follower transition for 
p0 because the designated leader b0 is offline. So it will not create the local 
replica object for p0.
* It will then start the high water mark checkpoint thread. Since only p1 has a 
local replica object, only p1's high water mark will be checkpointed to disk. 
p0's previously written checkpoint  if any will be lost.

So in summary it seems we should always create the local replica object even if 
the online transition does not happen.

Possible symptoms of the above bug could be one or more of the following (we 
saw 2 and 3):
# Data loss; yes on a hard-kill data loss is expected, but this can actually 
cause loss of nearly all data if the broker becomes follower, truncates, and 
soon after happens to become leader.
# High IO on brokers that lose their high water mark then subsequently (on a 
successful become follower transition) truncate their log to zero and start 
catching up from the beginning.
# If the offsets topic is affected, then offsets can get reset. This is because 
during an offset load we don't read past the high water mark. So if a water 
mark is missing then we don't load anything (even if the offsets are there in 
the log).


  was:
We ran into this scenario recently in a production environment. This can happen 
when enough brokers in a cluster are taken down. i.e., a rolling bounce done 
properly should not cause this issue. It can occur if all replicas for any 
partition are taken down.

Here is a sample scenario:

* Cluster of three brokers: b0, b1, b2
* Two partitions (of some topic) with replication factor two: p0, p1
* Initial state:
** p0: leader = b0, ISR = {b0, b1}
** p1: leader = b1, ISR = {b0, b1}
* Do a parallel hard-kill of all brokers
* Bring up b2, so it is the new controller
* b2 initializes its controller context and populates its leader/ISR cache 
(i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last 
known leaders are b0 (for p0) and b1 (for p2)
* Bring up b1
* The controller's onBrokerStartup procedure initiates a replica state change 
for all replicas on b1 to become online. As part of this replica state change 
it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 
(for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: 
leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not 
included in the leaders field because b0 is down.
* On receiving the LeaderAndIsrRequest, b1's replica manager will successfully 
make itself (b1) the leader for p1 (and create the local replica object 
corresponding to p1). It will however abort the become follower transition for 
p0 because the designated leader b0 is offline. So it will not create the local 
replica object for p0.
* It will then start the high water mark checkpoint thread. Since only p1 has a 
local replica object, only p1's high water mark will be checkpointed to disk. 
p0's previously written checkpoint  if any will be lost.

So in summary it seems we should always create the local replica object even if 
the online transition does not happen.

Possible symptoms of the above bug could be one or more of the following (we 
saw 2 and 3):
# Data loss; yes on a hard-kill data loss is expected, but this can actually 
cause loss of nearly all data if the broker becomes follower, truncates, and 
soon after happens to become leader.
# High IO on brokers that lose their high water mark then subsequently (on a 
successful become follower transition) truncate their log to zero and start 
catching up from the beginning.
# If the offsets topic is affected, then offsets can get reset. This is because 
during an offset load we don't read past the high water mark. So if a water 
mark is missing then we don't load anything (even if the offsets are there in 
the log).



> Replication offset checkpoints (high water marks) can be lost on hard kills 
> and restarts
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1647
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1647
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Joel Koshy
>            Priority: Critical
>              Labels: newbie++
>
> We ran into this scenario recently in a production environment. This can 
> happen when enough brokers in a cluster are taken down. i.e., a rolling 
> bounce done properly should not cause this issue. It can occur if all 
> replicas for any partition are taken down.
> Here is a sample scenario:
> * Cluster of three brokers: b0, b1, b2
> * Two partitions (of some topic) with replication factor two: p0, p1
> * Initial state:
> p0: leader = b0, ISR = {b0, b1}
> p1: leader = b1, ISR = {b0, b1}
> * Do a parallel hard-kill of all brokers
> * Bring up b2, so it is the new controller
> * b2 initializes its controller context and populates its leader/ISR cache 
> (i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last 
> known leaders are b0 (for p0) and b1 (for p2)
> * Bring up b1
> * The controller's onBrokerStartup procedure initiates a replica state change 
> for all replicas on b1 to become online. As part of this replica state change 
> it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 
> (for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: 
> leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not 
> included in the leaders field because b0 is down.
> * On receiving the LeaderAndIsrRequest, b1's replica manager will 
> successfully make itself (b1) the leader for p1 (and create the local replica 
> object corresponding to p1). It will however abort the become follower 
> transition for p0 because the designated leader b0 is offline. So it will not 
> create the local replica object for p0.
> * It will then start the high water mark checkpoint thread. Since only p1 has 
> a local replica object, only p1's high water mark will be checkpointed to 
> disk. p0's previously written checkpoint  if any will be lost.
> So in summary it seems we should always create the local replica object even 
> if the online transition does not happen.
> Possible symptoms of the above bug could be one or more of the following (we 
> saw 2 and 3):
> # Data loss; yes on a hard-kill data loss is expected, but this can actually 
> cause loss of nearly all data if the broker becomes follower, truncates, and 
> soon after happens to become leader.
> # High IO on brokers that lose their high water mark then subsequently (on a 
> successful become follower transition) truncate their log to zero and start 
> catching up from the beginning.
> # If the offsets topic is affected, then offsets can get reset. This is 
> because during an offset load we don't read past the high water mark. So if a 
> water mark is missing then we don't load anything (even if the offsets are 
> there in the log).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to