Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Tom Bentley Fri, 11 Aug 2017 04:56:08 -0700

Hi Jun,

Thanks for your reply, I've got a few comment inline...

On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote:

> Hi, Tom,
>
> Thanks for the KIP. Looks good overall. A few minor comments.
>
> 1. In most requests with topic partitions, we nest partitions inside topic.
> So, instead of [topic partition_id], we do [topic, [partition_id]] to save
> some space.
>

Of course, now sorted.

> 2. The preferred leader election just tries to move the leader to the
> preferred replica. This may not succeed if the preferred replica is not
> in-sync yet. It would be useful to return an error code to reflect that.
>

The algorithm the existing command uses is:

1. The command writes some JSON to the /admin/preferred_replica_election
znode. The command then returns asynchronously. It only signals an error if
the znode write failed.
2. The controller is subscribed to changes to this znode and delegates to
the PartitionStateMachine to do the actual election.
3. The PreferredReplicaPartitionLeaderSelector is used and throws a
StateChangeFailedException if the preferred leader in not in the ISR, or is
not alive.
4. This exception will propagate to the ControllerEventManager which will
log it.

In the event of controller failover, the JSON is read from the znode and
the election proceeds in a similar manner. So the central problem is the
asynchronicity.

I agree it would be better if the return from the AdminClient were
synchronous. I'm still learning about how the internals of Kafka work. It's
not enough for the AdminManager to wait (with a timeout) for the metadata
cache to show that the leader has been elected, since we need want to know
the reason if the election is not possible. So would an appropriate
mechanism to pass this information back from the KafkaController to the
AdminManager be to use another znode, or is there another way to do it?
I've not found any precedents for this, so please point them out if they
exist.

> 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is single
> threaded, is that just for timeout? If so, perhaps we can just return the
> Timeout error code.
>

Again, this pertains to the current algorithm for the election: When the
elections have been done the znode is deleted. Thus the presence of the
znode indicates an election in progress, which causes the command to just
give up (the user can try again later).

We still need the znode in order to honour the existing elections in the
event of controller failover part-way through, thus we need to do something
in the event that the znode exists at the time we want to write to it.

I suppose it is possible to append to the JSON in the znode while elections
are already taking place, with the controller reading from the head of the
JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a
recipe for this. But I don't see that pattern being used elsewhere in
Kafka, so I'm not sure if this is what you want.

It is not clear (to me at least) that the current restriction preventing
calling new elections while elections are on-going is a problem in
practice. But since both you and Ewen have raised this issue I assume it
does need to be solved: I just need a bit more of a hint about whether
there's a preferred way to solve it.

> 4. Permission wise, it seems this should require ClusterAction on Cluster
> resource since it's changing things at the cluster level.
>

As I said in my reply to Ewen, I can see that point of view (and it's what
I originally specified), so I don't mind changing it back, but it would be
good to hear more opinions on this.

Cheers,

Tom

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Reply via email to