Hi Jun, Thanks for your reply, I've got a few comment inline...
On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote: > Hi, Tom, > > Thanks for the KIP. Looks good overall. A few minor comments. > > 1. In most requests with topic partitions, we nest partitions inside topic. > So, instead of [topic partition_id], we do [topic, [partition_id]] to save > some space. > Of course, now sorted. > 2. The preferred leader election just tries to move the leader to the > preferred replica. This may not succeed if the preferred replica is not > in-sync yet. It would be useful to return an error code to reflect that. > The algorithm the existing command uses is: 1. The command writes some JSON to the /admin/preferred_replica_election znode. The command then returns asynchronously. It only signals an error if the znode write failed. 2. The controller is subscribed to changes to this znode and delegates to the PartitionStateMachine to do the actual election. 3. The PreferredReplicaPartitionLeaderSelector is used and throws a StateChangeFailedException if the preferred leader in not in the ISR, or is not alive. 4. This exception will propagate to the ControllerEventManager which will log it. In the event of controller failover, the JSON is read from the znode and the election proceeds in a similar manner. So the central problem is the asynchronicity. I agree it would be better if the return from the AdminClient were synchronous. I'm still learning about how the internals of Kafka work. It's not enough for the AdminManager to wait (with a timeout) for the metadata cache to show that the leader has been elected, since we need want to know the reason if the election is not possible. So would an appropriate mechanism to pass this information back from the KafkaController to the AdminManager be to use another znode, or is there another way to do it? I've not found any precedents for this, so please point them out if they exist. > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is single > threaded, is that just for timeout? If so, perhaps we can just return the > Timeout error code. > Again, this pertains to the current algorithm for the election: When the elections have been done the znode is deleted. Thus the presence of the znode indicates an election in progress, which causes the command to just give up (the user can try again later). We still need the znode in order to honour the existing elections in the event of controller failover part-way through, thus we need to do something in the event that the znode exists at the time we want to write to it. I suppose it is possible to append to the JSON in the znode while elections are already taking place, with the controller reading from the head of the JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a recipe for this. But I don't see that pattern being used elsewhere in Kafka, so I'm not sure if this is what you want. It is not clear (to me at least) that the current restriction preventing calling new elections while elections are on-going is a problem in practice. But since both you and Ewen have raised this issue I assume it does need to be solved: I just need a bit more of a hint about whether there's a preferred way to solve it. > 4. Permission wise, it seems this should require ClusterAction on Cluster > resource since it's changing things at the cluster level. > As I said in my reply to Ewen, I can see that point of view (and it's what I originally specified), so I don't mind changing it back, but it would be good to hear more opinions on this. Cheers, Tom