Hi Jun, Thanks for the feedback.
I've updated the KIP to mention this new algorithm, which I agree will be much better from the AdminClient PoV. I've also reverted the authorization to be against the cluster resource, rather than the topic, since, on re-reading, Ewen wasn't particularly advocating that it should be topic-based, merely wondering if that made more sense in addition to the cluster-based check. Cheers, Tom On 11 August 2017 at 18:13, Jun Rao <j...@confluent.io> wrote: > Hi, Tom, > > 2. Yes, that's how manual preferred leader election currently works. I > think we could use this opportunity to simplify the process a bit though. > Doing preferred leader election is a lot cheaper than partition > reassignment since it only involves writing the new leader to ZK. So, we > probably don't need to persist the request in /admin/preferred_replica_ > election > in ZK. I was thinking of the following steps. > (a) The controller receives PreferredLeaderElectionRequest. > (b) The controller adds a PreferredReplicaLeaderEelection event in the > queue. > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be > processed by the controller and then send a response. > > If the controller changes and dies in the middle of this process, the > client will get an error and the client has to reissue the request. > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1 > inflight preferred leader election request, in step (a) above, we can set a > flag in the controller to indicate that a preferred leader election is in > progress. The flag will be reset once the controller finishes processing > the request. Until the flag is reset, any new > PreferredLeaderElectionRequest > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS > error. > > Does this sound good to you? > > Thanks, > > Jun > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley <t.j.bent...@gmail.com> > wrote: > > > Hi Jun, > > > > Thanks for your reply, I've got a few comment inline... > > > > On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote: > > > > > Hi, Tom, > > > > > > Thanks for the KIP. Looks good overall. A few minor comments. > > > > > > 1. In most requests with topic partitions, we nest partitions inside > > topic. > > > So, instead of [topic partition_id], we do [topic, [partition_id]] to > > save > > > some space. > > > > > > > Of course, now sorted. > > > > > > > 2. The preferred leader election just tries to move the leader to the > > > preferred replica. This may not succeed if the preferred replica is not > > > in-sync yet. It would be useful to return an error code to reflect > that. > > > > > > > The algorithm the existing command uses is: > > > > 1. The command writes some JSON to the /admin/preferred_replica_election > > znode. The command then returns asynchronously. It only signals an error > if > > the znode write failed. > > 2. The controller is subscribed to changes to this znode and delegates to > > the PartitionStateMachine to do the actual election. > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a > > StateChangeFailedException if the preferred leader in not in the ISR, or > is > > not alive. > > 4. This exception will propagate to the ControllerEventManager which will > > log it. > > > > In the event of controller failover, the JSON is read from the znode and > > the election proceeds in a similar manner. So the central problem is the > > asynchronicity. > > > > I agree it would be better if the return from the AdminClient were > > synchronous. I'm still learning about how the internals of Kafka work. > It's > > not enough for the AdminManager to wait (with a timeout) for the metadata > > cache to show that the leader has been elected, since we need want to > know > > the reason if the election is not possible. So would an appropriate > > mechanism to pass this information back from the KafkaController to the > > AdminManager be to use another znode, or is there another way to do it? > > I've not found any precedents for this, so please point them out if they > > exist. > > > > > > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is > > single > > > threaded, is that just for timeout? If so, perhaps we can just return > the > > > Timeout error code. > > > > > > > Again, this pertains to the current algorithm for the election: When the > > elections have been done the znode is deleted. Thus the presence of the > > znode indicates an election in progress, which causes the command to just > > give up (the user can try again later). > > > > We still need the znode in order to honour the existing elections in the > > event of controller failover part-way through, thus we need to do > something > > in the event that the znode exists at the time we want to write to it. > > > > I suppose it is possible to append to the JSON in the znode while > elections > > are already taking place, with the controller reading from the head of > the > > JSON (thus treating the JSON like a queue), since the ZooKeeper docs > give a > > recipe for this. But I don't see that pattern being used elsewhere in > > Kafka, so I'm not sure if this is what you want. > > > > It is not clear (to me at least) that the current restriction preventing > > calling new elections while elections are on-going is a problem in > > practice. But since both you and Ewen have raised this issue I assume it > > does need to be solved: I just need a bit more of a hint about whether > > there's a preferred way to solve it. > > > > > > > 4. Permission wise, it seems this should require ClusterAction on > Cluster > > > resource since it's changing things at the cluster level. > > > > > > > As I said in my reply to Ewen, I can see that point of view (and it's > what > > I originally specified), so I don't mind changing it back, but it would > be > > good to hear more opinions on this. > > > > Cheers, > > > > Tom > > >