Hi Jun,

Thanks for the feedback.

I've updated the KIP to mention this new algorithm, which I agree will be
much better from the AdminClient PoV.

I've also reverted the authorization to be against the cluster resource,
rather than the topic, since, on re-reading, Ewen wasn't particularly
advocating that it should be topic-based, merely wondering if that made
more sense in addition to the cluster-based check.

Cheers,

Tom

On 11 August 2017 at 18:13, Jun Rao <j...@confluent.io> wrote:

> Hi, Tom,
>
> 2. Yes, that's how manual preferred leader election currently works. I
> think we could use this opportunity to simplify the process a bit though.
> Doing preferred leader election is a lot cheaper than partition
> reassignment since it only involves writing the new leader to ZK. So, we
> probably don't need to persist the request in /admin/preferred_replica_
> election
> in ZK. I was thinking of the following steps.
> (a) The controller receives PreferredLeaderElectionRequest.
> (b) The controller adds a PreferredReplicaLeaderEelection event in the
> queue.
> (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be
> processed by the controller and then send a response.
>
> If the controller changes and dies in the middle of this process, the
> client will get an error and the client has to reissue the request.
>
> Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1
> inflight preferred leader election request, in step (a) above, we can set a
> flag in the controller to indicate that a preferred leader election is in
> progress. The flag will be reset once the controller finishes processing
> the request. Until the flag is reset, any new
> PreferredLeaderElectionRequest
> will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS
>  error.
>
> Does this sound good to you?
>
> Thanks,
>
> Jun
>
> On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
>
> > Hi Jun,
> >
> > Thanks for your reply, I've got a few comment inline...
> >
> > On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Tom,
> > >
> > > Thanks for the KIP. Looks good overall. A few minor comments.
> > >
> > > 1. In most requests with topic partitions, we nest partitions inside
> > topic.
> > > So, instead of [topic partition_id], we do [topic, [partition_id]] to
> > save
> > > some space.
> > >
> >
> > Of course, now sorted.
> >
> >
> > > 2. The preferred leader election just tries to move the leader to the
> > > preferred replica. This may not succeed if the preferred replica is not
> > > in-sync yet. It would be useful to return an error code to reflect
> that.
> > >
> >
> > The algorithm the existing command uses is:
> >
> > 1. The command writes some JSON to the /admin/preferred_replica_election
> > znode. The command then returns asynchronously. It only signals an error
> if
> > the znode write failed.
> > 2. The controller is subscribed to changes to this znode and delegates to
> > the PartitionStateMachine to do the actual election.
> > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> > StateChangeFailedException if the preferred leader in not in the ISR, or
> is
> > not alive.
> > 4. This exception will propagate to the ControllerEventManager which will
> > log it.
> >
> > In the event of controller failover, the JSON is read from the znode and
> > the election proceeds in a similar manner. So the central problem is the
> > asynchronicity.
> >
> > I agree it would be better if the return from the AdminClient were
> > synchronous. I'm still learning about how the internals of Kafka work.
> It's
> > not enough for the AdminManager to wait (with a timeout) for the metadata
> > cache to show that the leader has been elected, since we need want to
> know
> > the reason if the election is not possible. So would an appropriate
> > mechanism to pass this information back from the KafkaController to the
> > AdminManager be to use another znode, or is there another way to do it?
> > I've not found any precedents for this, so please point them out if they
> > exist.
> >
> >
> > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is
> > single
> > > threaded, is that just for timeout? If so, perhaps we can just return
> the
> > > Timeout error code.
> > >
> >
> > Again, this pertains to the current algorithm for the election: When the
> > elections have been done the znode is deleted. Thus the presence of the
> > znode indicates an election in progress, which causes the command to just
> > give up (the user can try again later).
> >
> > We still need the znode in order to honour the existing elections in the
> > event of controller failover part-way through, thus we need to do
> something
> > in the event that the znode exists at the time we want to write to it.
> >
> > I suppose it is possible to append to the JSON in the znode while
> elections
> > are already taking place, with the controller reading from the head of
> the
> > JSON (thus treating the JSON like a queue), since the ZooKeeper docs
> give a
> > recipe for this. But I don't see that pattern being used elsewhere in
> > Kafka, so I'm not sure if this is what you want.
> >
> > It is not clear (to me at least) that the current restriction preventing
> > calling new elections while elections are on-going is a problem in
> > practice. But since both you and Ewen have raised this issue I assume it
> > does need to be solved: I just need a bit more of a hint about whether
> > there's a preferred way to solve it.
> >
> >
> > > 4. Permission wise, it seems this should require ClusterAction on
> Cluster
> > > resource since it's changing things at the cluster level.
> > >
> >
> > As I said in my reply to Ewen, I can see that point of view (and it's
> what
> > I originally specified), so I don't mind changing it back, but it would
> be
> > good to hear more opinions on this.
> >
> > Cheers,
> >
> > Tom
> >
>

Reply via email to