Hi Tom,

As you said, the controller doesn't know about log directories, although 
individual brokers do.  So the brokers do currently have to enumerate all the 
partitions that need to be removed to the controllers explicitly.  So this KIP 
isn't changing anything in that regard.

The current flow is:
1. ping ZK back-channel
2. controller sends a full LeaderAndIsrRequest to the broker
3. the broker sends a full response containing error codes for all partitions.  
Partitions on the failed storage have a nonzero error code; the others have 0.

The new flow is:
1. the broker sends an RPC with all the failed partitions

So the new flow actually substantially reduces the amount of network traffic, 
since previously we sent a full LeaderAndIsrRequest containing all of the 
partitions.  Now we just send all the partitions in the failed storage 
directory.  That could still be a lot, but certainly only be a fraction of what 
a full LeaderAndIsrRequest would have.

Sorry if I'm repeating stuff you already figured out, but I just wanted to be 
more clear about this (I found it confusing too until David explained it to me 
originally...)

best,
Colin


On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> Hi David,
> 
> > In the rejecting the alternative of having an RPC for log dir failures
> > you say
> >
> > I guess what I really mean here is that I wanted to avoid exposing the
> > notion of a log dir to the controller. I can update the description to
> > reflect this.
> >
> 
> Ah, I think I see now. While each broker knows about its log dirs this
> isn't something that's stored in zookeeper or known to the controller.
> 
> 
> > > It's also not completely clear that the cost of having to enumerate all
> > the partitions on a log dir was weighed against the perceived benefit of a
> > more flexible RPC.
> >
> > The enumeration isn't strictly required. In the "RPC semantics" section, I
> > mention that if no topics are present in the RPC request, then all topics
> > on the broker are implied. And if a topic is given with no partitions, all
> > partitions for that topic (on the broker) are implied. Does this make
> > sense?
> >
> 
> So the no-topics-present optimisation wouldn't be available to a broker
> with >1 log dirs where only some of the log dirs failed. I don't suppose
> that's a problem though.
> 
> Thanks again,
> 
> Tom
> 
> 
> On Fri, May 1, 2020 at 5:48 PM David Arthur <mum...@gmail.com> wrote:
> 
> > Jose/Colin/Tom, thanks for the feedback!
> >
> > > Partition level errors
> >
> > This was an oversight on my part, I meant to include these in the response
> > RPC. I'll update that.
> >
> > > INVALID_REQUEST
> >
> > I'll update this text description, that was a copy/paste left over
> >
> > > I think we should mention that the controller will keep it's current
> > implementation of marking the replicas as offline because of failure in the
> > LeaderAndIsr response.
> >
> > Good suggestions, I'll add that.
> >
> > > Does EventType need to be an Int32?
> >
> > No, it doesn't. I'll update to Int8. Do we have an example of the enum
> > paradigm in our RPC today? I'm curious if we actually map it to a real Java
> > enum in the AbstractRequest/Response classes.
> >
> > > AlterReplicaStates
> >
> > Sounds good to me.
> >
> > > In the rejecting the alternative of having an RPC for log dir failures
> > you say
> >
> > I guess what I really mean here is that I wanted to avoid exposing the
> > notion of a log dir to the controller. I can update the description to
> > reflect this.
> >
> > > It's also not completely clear that the cost of having to enumerate all
> > the partitions on a log dir was weighed against the perceived benefit of a
> > more flexible RPC.
> >
> > The enumeration isn't strictly required. In the "RPC semantics" section, I
> > mention that if no topics are present in the RPC request, then all topics
> > on the broker are implied. And if a topic is given with no partitions, all
> > partitions for that topic (on the broker) are implied. Does this make
> > sense?
> >
> > Thanks again! I'll update the KIP and leave a message here once it's
> > revised.
> >
> > David
> >
> > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tbent...@redhat.com> wrote:
> >
> > > Hi David,
> > >
> > > Thanks for the KIP!
> > >
> > > In the rejecting the alternative of having an RPC for log dir failures
> > you
> > > say:
> > >
> > > It was also rejected to prevent "leaking" the notion of a log dir to the
> > > > public API.
> > > >
> > >
> > > I'm not quite sure I follow that argument, since we already have RPCs for
> > > changing replica log dirs. So in a general sense log dirs already exist
> > in
> > > the API. I suspect you were using public API to mean something more
> > > specific; could you elaborate?
> > >
> > > It's also not completely clear that the cost of having to enumerate all
> > the
> > > partitions on a log dir was weighed against the perceived benefit of a
> > more
> > > flexible RPC. (I'm sure it was, but it would be good to say so).
> > >
> > > Many thanks,
> > >
> > > Tom
> > >
> > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cmcc...@apache.org>
> > wrote:
> > >
> > > > Hi David,
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > I think the ReplicaStateEventResponse should have a separate error code
> > > > for each partition.
> > > >  Currently it just has one error code for the whole request/response,
> > if
> > > > I'm reading this right.  I think Jose made a similar point as well.  We
> > > > should plan for scenarios where some replica states can be changed and
> > > some
> > > > can't.
> > > >
> > > > Does EventType need to be an Int32?  For enums, we usually use the
> > > > smallest reasonable type, which would be Int8 here.  We can always
> > change
> > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > unnecessary
> > > > since INVALID_REQUEST covers this case.
> > > >
> > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a slightly
> > > > better name for this RPC.
> > > >
> > > > cheers,
> > > > Colin
> > > >
> > > >
> > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > Hey everyone,
> > > > >
> > > > > I'd like to start the discussion for KIP-589, part of the KIP-500
> > > effort
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > >
> > > > > This is a proposal to use a new RPC instead of ZooKeeper for
> > notifying
> > > > the
> > > > > controller of an offline replica. Please give a read and let me know
> > > your
> > > > > thoughts.
> > > > >
> > > > > Thanks!
> > > > > David
> > > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > David Arthur
> >
>

Reply via email to