[jira] [Commented] (KAFKA-3083) a soft failure in controller may leave a topic partition in an inconsistent state

Flavio Junqueira (JIRA) Thu, 21 Jan 2016 06:43:26 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110680#comment-15110680
 ]


Flavio Junqueira commented on KAFKA-3083:
-----------------------------------------

bq. 1) We need to use a multi-op that combines the update to the ISR and a 
znode check. The znode check verifies that the version of the controller 
leadership znode is still the same and if it passes, then the ISR data is 
updated.

I was really just thinking out loud, the multiop is just a hack to get around 
the fact that controller broker doesn't know if the underlying session has been 
recreated or not. The comment about using multiop was simply pointing that you 
can check and update atomically with this multiop recipe. If we do this the 
right way, then we don't need to use a multiop call.

bq. 2) The race condition that Jun Rao mentioned still exist above in 1).

It still exists but the multiop would fail to perform the update on ZK if 
you're checking a version.

bq. 4) To do step 3), as Jun Rao suggested we have to detect the connection 
loss event.

There are two parts. Detecting connection loss is one of them. If the 
controller isn't sure about its session when it receives connection loss, then 
it should stop. The second part is not to create a new session if the previous 
one expired. If the session of A has expired, which must happen by step 2) 
otherwise B can't be elected, then A isn't able to get requests completed on 
the expired session. Once B is elected, the session of A must have expired and 
no update coming from A will be executed. Of course, we want to bring broker A 
back up and to do it, we need to start a new session. However, before starting 
a new session, we need to make sure to stop any controller work in A.

bq. i) Broker A has connection loss and connects immediately in which case it 
gets a SyncConnected event. Now the session MIGHT NOT have expired since the 
connection happened immediately. Broker A is expected to continue since it is 
still the controller and the session has not expired. ii) Broker A has 
connection loss and connects back in which case it gets a SyncConnected event. 
Now the session MIGHT have expired. Broker A is expected to stop all the zk 
operations.

The broker will only get SyncConnected if it connects and it is able to 
validate the session. If the session is invalid, then it gets an Expired 
notification. Note that if we are using SASL to authenticate, then we could be 
also getting an authenticated event.

> a soft failure in controller may leave a topic partition in an inconsistent 
> state
> ---------------------------------------------------------------------------------
>
>                 Key: KAFKA-3083
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3083
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0.0
>            Reporter: Jun Rao
>            Assignee: Mayuresh Gharat
>
> The following sequence can happen.
> 1. Broker A is the controller and is in the middle of processing a broker 
> change event. As part of this process, let's say it's about to shrink the isr 
> of a partition.
> 2. Then broker A's session expires and broker B takes over as the new 
> controller. Broker B sends the initial leaderAndIsr request to all brokers.
> 3. Broker A continues by shrinking the isr of the partition in ZK and sends 
> the new leaderAndIsr request to the broker (say C) that leads the partition. 
> Broker C will reject this leaderAndIsr since the request comes from a 
> controller with an older epoch. Now we could be in a situation that Broker C 
> thinks the isr has all replicas, but the isr stored in ZK is different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3083) a soft failure in controller may leave a topic partition in an inconsistent state

Reply via email to