Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Jun Rao Fri, 11 Aug 2017 10:13:47 -0700

Hi, Tom,

2. Yes, that's how manual preferred leader election currently works. I
think we could use this opportunity to simplify the process a bit though.
Doing preferred leader election is a lot cheaper than partition
reassignment since it only involves writing the new leader to ZK. So, we
probably don't need to persist the request in /admin/preferred_replica_election
in ZK. I was thinking of the following steps.
(a) The controller receives PreferredLeaderElectionRequest.
(b) The controller adds a PreferredReplicaLeaderEelection event in the
queue.
(c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be
processed by the controller and then send a response.


If the controller changes and dies in the middle of this process, the
client will get an error and the client has to reissue the request.

Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1
inflight preferred leader election request, in step (a) above, we can set a
flag in the controller to indicate that a preferred leader election is in
progress. The flag will be reset once the controller finishes processing
the request. Until the flag is reset, any new PreferredLeaderElectionRequest
will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS
 error.

Does this sound good to you?

Thanks,

Jun

On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley <t.j.bent...@gmail.com> wrote:

> Hi Jun,
>
> Thanks for your reply, I've got a few comment inline...
>
> On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the KIP. Looks good overall. A few minor comments.
> >
> > 1. In most requests with topic partitions, we nest partitions inside
> topic.
> > So, instead of [topic partition_id], we do [topic, [partition_id]] to
> save
> > some space.
> >
>
> Of course, now sorted.
>
>
> > 2. The preferred leader election just tries to move the leader to the
> > preferred replica. This may not succeed if the preferred replica is not
> > in-sync yet. It would be useful to return an error code to reflect that.
> >
>
> The algorithm the existing command uses is:
>
> 1. The command writes some JSON to the /admin/preferred_replica_election
> znode. The command then returns asynchronously. It only signals an error if
> the znode write failed.
> 2. The controller is subscribed to changes to this znode and delegates to
> the PartitionStateMachine to do the actual election.
> 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> StateChangeFailedException if the preferred leader in not in the ISR, or is
> not alive.
> 4. This exception will propagate to the ControllerEventManager which will
> log it.
>
> In the event of controller failover, the JSON is read from the znode and
> the election proceeds in a similar manner. So the central problem is the
> asynchronicity.
>
> I agree it would be better if the return from the AdminClient were
> synchronous. I'm still learning about how the internals of Kafka work. It's
> not enough for the AdminManager to wait (with a timeout) for the metadata
> cache to show that the leader has been elected, since we need want to know
> the reason if the election is not possible. So would an appropriate
> mechanism to pass this information back from the KafkaController to the
> AdminManager be to use another znode, or is there another way to do it?
> I've not found any precedents for this, so please point them out if they
> exist.
>
>
> > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is
> single
> > threaded, is that just for timeout? If so, perhaps we can just return the
> > Timeout error code.
> >
>
> Again, this pertains to the current algorithm for the election: When the
> elections have been done the znode is deleted. Thus the presence of the
> znode indicates an election in progress, which causes the command to just
> give up (the user can try again later).
>
> We still need the znode in order to honour the existing elections in the
> event of controller failover part-way through, thus we need to do something
> in the event that the znode exists at the time we want to write to it.
>
> I suppose it is possible to append to the JSON in the znode while elections
> are already taking place, with the controller reading from the head of the
> JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a
> recipe for this. But I don't see that pattern being used elsewhere in
> Kafka, so I'm not sure if this is what you want.
>
> It is not clear (to me at least) that the current restriction preventing
> calling new elections while elections are on-going is a problem in
> practice. But since both you and Ewen have raised this issue I assume it
> does need to be solved: I just need a bit more of a hint about whether
> there's a preferred way to solve it.
>
>
> > 4. Permission wise, it seems this should require ClusterAction on Cluster
> > resource since it's changing things at the cluster level.
> >
>
> As I said in my reply to Ewen, I can see that point of view (and it's what
> I originally specified), so I don't mind changing it back, but it would be
> good to hear more opinions on this.
>
> Cheers,
>
> Tom
>

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Reply via email to