Hi José,

Thanks for the KIP.
It's indeed a problem that we should handle it well!

Just a minor comment:
In the `PartitionChangeRecord` schema, you set the `isUnclean` field as
type boolean,
but the value (-1, 0, 1) are all int values. Did you mean int32 type or the
value should be updated?

Thank you.
Luke

On Wed, Jan 19, 2022 at 12:04 AM Colin McCabe <cmcc...@apache.org> wrote:

> Hi José,
>
> Thanks for the KIP.
>
> The KIP talks a bit about "recovery," which is a new term (as far as I
> know). If I understand correctly, this is a state that the partition enters
> into after an unclean leader election. I would suggest using a different
> term for this, since we already use the term "log recovery" to express the
> cleanup we do after an unclean broker restart. How about something like
> "leader cleansing"? Whatever term we pick, we should probably use it in all
> the metadata rather than is_unclean, etc.
>
> I think we also need to spell out a few other things about "cleansing"
> state. For example, if the controller needs to change the leader of the
> partition again, before cleansing completes, does that mean the partition
> is still unclean? I would guess yes. We should spell this out in the KIP.
>
> With regard to AlterIsr, it seems like we might end up in a scenario where
> the broker is on an old version that doesn't support the cleansing state,
> and the controller is on a new version that does. In that case, it seems
> like we could get stuck in the cleansing state for an arbitrary amount of
> time. We will not exit it until the broker sends an AlterIsr, but the
> broker may not send one of those ever (for example, if we go down to one
> uncleanly elected replica, and stay there).
>
> The most obvious way to resolve this would be to use the IBP /
> metadata.version to gate using this feature. So it sounds like we should
> add a few lines to the KIP indicating that we're doing that.
>
> I think the compatibility logic needs needs more thought. One example is,
> if we have some brokers in the cluster that support KIP-704 and some that
> don't, do we auto-clear the cleansing state when transitioning leadership
> uncleanly to an old broker? I think you probably have to. So we should
> discuss this in the KIP.
>
> best,
> Colin
>
>
> On Mon, Jan 17, 2022, at 19:27, José Armando García Sancio wrote:
> > Jose wrote:
> >> I'll update the KIP with this information. The leader will return
> >> "NOT_LEADER_OR_FOLLOWER" for any partition that is still recovering
> >> for Fetch, Produce, OffsetsForLeaderEpoch and DeleteRecords requests.
> >> This error type is retriable by the clients.
> >
> > I forgot to include ListOffset. Updated the KIP to include this
> information.
> >
> > Thanks
> > -Jose
>

Reply via email to