Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

David Arthur Tue, 19 May 2020 12:56:33 -0700

Thanks, Jason. Good feedback

1. I was mostly referring to the fact that the ReplicaManager uses a
background thread to send the ZK notification and it really has no
visibility as to whether the ZK operation succeeded or not. We'll most
likely want to continue using a background thread for batching purposes
with the new RPC. Retries make sense as well.


2. Yes, I'll change that

3. Thanks, I neglected to mention this. Indeed I was considering
ControlledShutdown when originally thinking about this KIP. A Future Work
section is a good idea, I'll add one.

On Tue, May 19, 2020 at 2:58 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi David,
>
> This looks good. I just have a few comments:
>
> 1. I'm not sure it's totally fair to describe the current notification
> mechanism as "best-effort." At least it guarantees that the controller will
> eventually see the event. In any case, I think we might want a stronger
> contract going forward. As long as the broker remains the leader for
> partitions in offline log directories, it seems like we should retry the
> AlterReplicaState requests.
> 2. Should we consider a new name for `UNKNOWN_REPLICA_EVENT_TYPE`? Maybe
> `UNKOWN_REPLICA_STATE`?
> 3. Mostly an observation, but there is some overlap with this API and
> ControlledShutdown. From the controller's perspective, the intent is mostly
> the same. I guess we could treat a null array in the request as an intent
> to shutdown all replicas if we wanted to try and converge these APIs. One
> of the differences is that ControlledShutdown is a synchronous API, but I
> think it would have actually been better as an asynchronous API since
> historically we have run into problems with timeouts. Anyway, this is
> outside the scope of this KIP, but might be worth mentioning as "Future
> work" somewhere.
>
> Thanks,
> Jason
>
>
> On Mon, May 18, 2020 at 10:09 AM David Arthur <mum...@gmail.com> wrote:
>
> > I've updated the KIP with the feedback from this discussion
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > .
> > I'll send out the vote thread shortly.
> >
> > Thanks again,
> > David
> >
> > On Tue, May 5, 2020 at 10:34 AM Tom Bentley <tbent...@redhat.com> wrote:
> >
> > > Hi Colin,
> > >
> > > Yeah, that makes sense, thanks. I was thinking, longer term, that there
> > are
> > > other benefits to having the log dir information available to the
> > > controller. For example it would allow the possibility for CREATE_TOPIC
> > > requests to include the intended log dir for each replica. But that's
> > > obviously completely out of scope for this KIP.
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cmcc...@apache.org>
> wrote:
> > >
> > > > Hi Tom,
> > > >
> > > > As you said, the controller doesn't know about log directories,
> > although
> > > > individual brokers do.  So the brokers do currently have to enumerate
> > all
> > > > the partitions that need to be removed to the controllers explicitly.
> > So
> > > > this KIP isn't changing anything in that regard.
> > > >
> > > > The current flow is:
> > > > 1. ping ZK back-channel
> > > > 2. controller sends a full LeaderAndIsrRequest to the broker
> > > > 3. the broker sends a full response containing error codes for all
> > > > partitions.  Partitions on the failed storage have a nonzero error
> > code;
> > > > the others have 0.
> > > >
> > > > The new flow is:
> > > > 1. the broker sends an RPC with all the failed partitions
> > > >
> > > > So the new flow actually substantially reduces the amount of network
> > > > traffic, since previously we sent a full LeaderAndIsrRequest
> containing
> > > all
> > > > of the partitions.  Now we just send all the partitions in the failed
> > > > storage directory.  That could still be a lot, but certainly only be
> a
> > > > fraction of what a full LeaderAndIsrRequest would have.
> > > >
> > > > Sorry if I'm repeating stuff you already figured out, but I just
> wanted
> > > to
> > > > be more clear about this (I found it confusing too until David
> > explained
> > > it
> > > > to me originally...)
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> > > > > Hi David,
> > > > >
> > > > > > In the rejecting the alternative of having an RPC for log dir
> > > failures
> > > > > > you say
> > > > > >
> > > > > > I guess what I really mean here is that I wanted to avoid
> exposing
> > > the
> > > > > > notion of a log dir to the controller. I can update the
> description
> > > to
> > > > > > reflect this.
> > > > > >
> > > > >
> > > > > Ah, I think I see now. While each broker knows about its log dirs
> > this
> > > > > isn't something that's stored in zookeeper or known to the
> > controller.
> > > > >
> > > > >
> > > > > > > It's also not completely clear that the cost of having to
> > enumerate
> > > > all
> > > > > > the partitions on a log dir was weighed against the perceived
> > benefit
> > > > of a
> > > > > > more flexible RPC.
> > > > > >
> > > > > > The enumeration isn't strictly required. In the "RPC semantics"
> > > > section, I
> > > > > > mention that if no topics are present in the RPC request, then
> all
> > > > topics
> > > > > > on the broker are implied. And if a topic is given with no
> > > partitions,
> > > > all
> > > > > > partitions for that topic (on the broker) are implied. Does this
> > make
> > > > > > sense?
> > > > > >
> > > > >
> > > > > So the no-topics-present optimisation wouldn't be available to a
> > broker
> > > > > with >1 log dirs where only some of the log dirs failed. I don't
> > > suppose
> > > > > that's a problem though.
> > > > >
> > > > > Thanks again,
> > > > >
> > > > > Tom
> > > > >
> > > > >
> > > > > On Fri, May 1, 2020 at 5:48 PM David Arthur <mum...@gmail.com>
> > wrote:
> > > > >
> > > > > > Jose/Colin/Tom, thanks for the feedback!
> > > > > >
> > > > > > > Partition level errors
> > > > > >
> > > > > > This was an oversight on my part, I meant to include these in the
> > > > response
> > > > > > RPC. I'll update that.
> > > > > >
> > > > > > > INVALID_REQUEST
> > > > > >
> > > > > > I'll update this text description, that was a copy/paste left
> over
> > > > > >
> > > > > > > I think we should mention that the controller will keep it's
> > > current
> > > > > > implementation of marking the replicas as offline because of
> > failure
> > > > in the
> > > > > > LeaderAndIsr response.
> > > > > >
> > > > > > Good suggestions, I'll add that.
> > > > > >
> > > > > > > Does EventType need to be an Int32?
> > > > > >
> > > > > > No, it doesn't. I'll update to Int8. Do we have an example of the
> > > enum
> > > > > > paradigm in our RPC today? I'm curious if we actually map it to a
> > > real
> > > > Java
> > > > > > enum in the AbstractRequest/Response classes.
> > > > > >
> > > > > > > AlterReplicaStates
> > > > > >
> > > > > > Sounds good to me.
> > > > > >
> > > > > > > In the rejecting the alternative of having an RPC for log dir
> > > > failures
> > > > > > you say
> > > > > >
> > > > > > I guess what I really mean here is that I wanted to avoid
> exposing
> > > the
> > > > > > notion of a log dir to the controller. I can update the
> description
> > > to
> > > > > > reflect this.
> > > > > >
> > > > > > > It's also not completely clear that the cost of having to
> > enumerate
> > > > all
> > > > > > the partitions on a log dir was weighed against the perceived
> > benefit
> > > > of a
> > > > > > more flexible RPC.
> > > > > >
> > > > > > The enumeration isn't strictly required. In the "RPC semantics"
> > > > section, I
> > > > > > mention that if no topics are present in the RPC request, then
> all
> > > > topics
> > > > > > on the broker are implied. And if a topic is given with no
> > > partitions,
> > > > all
> > > > > > partitions for that topic (on the broker) are implied. Does this
> > make
> > > > > > sense?
> > > > > >
> > > > > > Thanks again! I'll update the KIP and leave a message here once
> > it's
> > > > > > revised.
> > > > > >
> > > > > > David
> > > > > >
> > > > > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <
> tbent...@redhat.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Thanks for the KIP!
> > > > > > >
> > > > > > > In the rejecting the alternative of having an RPC for log dir
> > > > failures
> > > > > > you
> > > > > > > say:
> > > > > > >
> > > > > > > It was also rejected to prevent "leaking" the notion of a log
> dir
> > > to
> > > > the
> > > > > > > > public API.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not quite sure I follow that argument, since we already
> have
> > > > RPCs for
> > > > > > > changing replica log dirs. So in a general sense log dirs
> already
> > > > exist
> > > > > > in
> > > > > > > the API. I suspect you were using public API to mean something
> > more
> > > > > > > specific; could you elaborate?
> > > > > > >
> > > > > > > It's also not completely clear that the cost of having to
> > enumerate
> > > > all
> > > > > > the
> > > > > > > partitions on a log dir was weighed against the perceived
> benefit
> > > of
> > > > a
> > > > > > more
> > > > > > > flexible RPC. (I'm sure it was, but it would be good to say
> so).
> > > > > > >
> > > > > > > Many thanks,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <
> > cmcc...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for the KIP!
> > > > > > > >
> > > > > > > > I think the ReplicaStateEventResponse should have a separate
> > > error
> > > > code
> > > > > > > > for each partition.
> > > > > > > >  Currently it just has one error code for the whole
> > > > request/response,
> > > > > > if
> > > > > > > > I'm reading this right.  I think Jose made a similar point as
> > > > well.  We
> > > > > > > > should plan for scenarios where some replica states can be
> > > changed
> > > > and
> > > > > > > some
> > > > > > > > can't.
> > > > > > > >
> > > > > > > > Does EventType need to be an Int32?  For enums, we usually
> use
> > > the
> > > > > > > > smallest reasonable type, which would be Int8 here.  We can
> > > always
> > > > > > change
> > > > > > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > > > > > unnecessary
> > > > > > > > since INVALID_REQUEST covers this case.
> > > > > > > >
> > > > > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a
> > > > slightly
> > > > > > > > better name for this RPC.
> > > > > > > >
> > > > > > > > cheers,
> > > > > > > > Colin
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > > > > > Hey everyone,
> > > > > > > > >
> > > > > > > > > I'd like to start the discussion for KIP-589, part of the
> > > KIP-500
> > > > > > > effort
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > > > >
> > > > > > > > > This is a proposal to use a new RPC instead of ZooKeeper
> for
> > > > > > notifying
> > > > > > > > the
> > > > > > > > > controller of an offline replica. Please give a read and
> let
> > me
> > > > know
> > > > > > > your
> > > > > > > > > thoughts.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > David
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > David Arthur
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > David Arthur
> >
>


-- 
David Arthur

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Reply via email to