Thanks, Jason. Good feedback 1. I was mostly referring to the fact that the ReplicaManager uses a background thread to send the ZK notification and it really has no visibility as to whether the ZK operation succeeded or not. We'll most likely want to continue using a background thread for batching purposes with the new RPC. Retries make sense as well.
2. Yes, I'll change that 3. Thanks, I neglected to mention this. Indeed I was considering ControlledShutdown when originally thinking about this KIP. A Future Work section is a good idea, I'll add one. On Tue, May 19, 2020 at 2:58 PM Jason Gustafson <ja...@confluent.io> wrote: > Hi David, > > This looks good. I just have a few comments: > > 1. I'm not sure it's totally fair to describe the current notification > mechanism as "best-effort." At least it guarantees that the controller will > eventually see the event. In any case, I think we might want a stronger > contract going forward. As long as the broker remains the leader for > partitions in offline log directories, it seems like we should retry the > AlterReplicaState requests. > 2. Should we consider a new name for `UNKNOWN_REPLICA_EVENT_TYPE`? Maybe > `UNKOWN_REPLICA_STATE`? > 3. Mostly an observation, but there is some overlap with this API and > ControlledShutdown. From the controller's perspective, the intent is mostly > the same. I guess we could treat a null array in the request as an intent > to shutdown all replicas if we wanted to try and converge these APIs. One > of the differences is that ControlledShutdown is a synchronous API, but I > think it would have actually been better as an asynchronous API since > historically we have run into problems with timeouts. Anyway, this is > outside the scope of this KIP, but might be worth mentioning as "Future > work" somewhere. > > Thanks, > Jason > > > On Mon, May 18, 2020 at 10:09 AM David Arthur <mum...@gmail.com> wrote: > > > I've updated the KIP with the feedback from this discussion > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller > > . > > I'll send out the vote thread shortly. > > > > Thanks again, > > David > > > > On Tue, May 5, 2020 at 10:34 AM Tom Bentley <tbent...@redhat.com> wrote: > > > > > Hi Colin, > > > > > > Yeah, that makes sense, thanks. I was thinking, longer term, that there > > are > > > other benefits to having the log dir information available to the > > > controller. For example it would allow the possibility for CREATE_TOPIC > > > requests to include the intended log dir for each replica. But that's > > > obviously completely out of scope for this KIP. > > > > > > Kind regards, > > > > > > Tom > > > > > > On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cmcc...@apache.org> > wrote: > > > > > > > Hi Tom, > > > > > > > > As you said, the controller doesn't know about log directories, > > although > > > > individual brokers do. So the brokers do currently have to enumerate > > all > > > > the partitions that need to be removed to the controllers explicitly. > > So > > > > this KIP isn't changing anything in that regard. > > > > > > > > The current flow is: > > > > 1. ping ZK back-channel > > > > 2. controller sends a full LeaderAndIsrRequest to the broker > > > > 3. the broker sends a full response containing error codes for all > > > > partitions. Partitions on the failed storage have a nonzero error > > code; > > > > the others have 0. > > > > > > > > The new flow is: > > > > 1. the broker sends an RPC with all the failed partitions > > > > > > > > So the new flow actually substantially reduces the amount of network > > > > traffic, since previously we sent a full LeaderAndIsrRequest > containing > > > all > > > > of the partitions. Now we just send all the partitions in the failed > > > > storage directory. That could still be a lot, but certainly only be > a > > > > fraction of what a full LeaderAndIsrRequest would have. > > > > > > > > Sorry if I'm repeating stuff you already figured out, but I just > wanted > > > to > > > > be more clear about this (I found it confusing too until David > > explained > > > it > > > > to me originally...) > > > > > > > > best, > > > > Colin > > > > > > > > > > > > On Sat, May 2, 2020, at 10:30, Tom Bentley wrote: > > > > > Hi David, > > > > > > > > > > > In the rejecting the alternative of having an RPC for log dir > > > failures > > > > > > you say > > > > > > > > > > > > I guess what I really mean here is that I wanted to avoid > exposing > > > the > > > > > > notion of a log dir to the controller. I can update the > description > > > to > > > > > > reflect this. > > > > > > > > > > > > > > > > Ah, I think I see now. While each broker knows about its log dirs > > this > > > > > isn't something that's stored in zookeeper or known to the > > controller. > > > > > > > > > > > > > > > > > It's also not completely clear that the cost of having to > > enumerate > > > > all > > > > > > the partitions on a log dir was weighed against the perceived > > benefit > > > > of a > > > > > > more flexible RPC. > > > > > > > > > > > > The enumeration isn't strictly required. In the "RPC semantics" > > > > section, I > > > > > > mention that if no topics are present in the RPC request, then > all > > > > topics > > > > > > on the broker are implied. And if a topic is given with no > > > partitions, > > > > all > > > > > > partitions for that topic (on the broker) are implied. Does this > > make > > > > > > sense? > > > > > > > > > > > > > > > > So the no-topics-present optimisation wouldn't be available to a > > broker > > > > > with >1 log dirs where only some of the log dirs failed. I don't > > > suppose > > > > > that's a problem though. > > > > > > > > > > Thanks again, > > > > > > > > > > Tom > > > > > > > > > > > > > > > On Fri, May 1, 2020 at 5:48 PM David Arthur <mum...@gmail.com> > > wrote: > > > > > > > > > > > Jose/Colin/Tom, thanks for the feedback! > > > > > > > > > > > > > Partition level errors > > > > > > > > > > > > This was an oversight on my part, I meant to include these in the > > > > response > > > > > > RPC. I'll update that. > > > > > > > > > > > > > INVALID_REQUEST > > > > > > > > > > > > I'll update this text description, that was a copy/paste left > over > > > > > > > > > > > > > I think we should mention that the controller will keep it's > > > current > > > > > > implementation of marking the replicas as offline because of > > failure > > > > in the > > > > > > LeaderAndIsr response. > > > > > > > > > > > > Good suggestions, I'll add that. > > > > > > > > > > > > > Does EventType need to be an Int32? > > > > > > > > > > > > No, it doesn't. I'll update to Int8. Do we have an example of the > > > enum > > > > > > paradigm in our RPC today? I'm curious if we actually map it to a > > > real > > > > Java > > > > > > enum in the AbstractRequest/Response classes. > > > > > > > > > > > > > AlterReplicaStates > > > > > > > > > > > > Sounds good to me. > > > > > > > > > > > > > In the rejecting the alternative of having an RPC for log dir > > > > failures > > > > > > you say > > > > > > > > > > > > I guess what I really mean here is that I wanted to avoid > exposing > > > the > > > > > > notion of a log dir to the controller. I can update the > description > > > to > > > > > > reflect this. > > > > > > > > > > > > > It's also not completely clear that the cost of having to > > enumerate > > > > all > > > > > > the partitions on a log dir was weighed against the perceived > > benefit > > > > of a > > > > > > more flexible RPC. > > > > > > > > > > > > The enumeration isn't strictly required. In the "RPC semantics" > > > > section, I > > > > > > mention that if no topics are present in the RPC request, then > all > > > > topics > > > > > > on the broker are implied. And if a topic is given with no > > > partitions, > > > > all > > > > > > partitions for that topic (on the broker) are implied. Does this > > make > > > > > > sense? > > > > > > > > > > > > Thanks again! I'll update the KIP and leave a message here once > > it's > > > > > > revised. > > > > > > > > > > > > David > > > > > > > > > > > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley < > tbent...@redhat.com> > > > > wrote: > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > > > In the rejecting the alternative of having an RPC for log dir > > > > failures > > > > > > you > > > > > > > say: > > > > > > > > > > > > > > It was also rejected to prevent "leaking" the notion of a log > dir > > > to > > > > the > > > > > > > > public API. > > > > > > > > > > > > > > > > > > > > > > I'm not quite sure I follow that argument, since we already > have > > > > RPCs for > > > > > > > changing replica log dirs. So in a general sense log dirs > already > > > > exist > > > > > > in > > > > > > > the API. I suspect you were using public API to mean something > > more > > > > > > > specific; could you elaborate? > > > > > > > > > > > > > > It's also not completely clear that the cost of having to > > enumerate > > > > all > > > > > > the > > > > > > > partitions on a log dir was weighed against the perceived > benefit > > > of > > > > a > > > > > > more > > > > > > > flexible RPC. (I'm sure it was, but it would be good to say > so). > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > > > Tom > > > > > > > > > > > > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe < > > cmcc...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > > > > > I think the ReplicaStateEventResponse should have a separate > > > error > > > > code > > > > > > > > for each partition. > > > > > > > > Currently it just has one error code for the whole > > > > request/response, > > > > > > if > > > > > > > > I'm reading this right. I think Jose made a similar point as > > > > well. We > > > > > > > > should plan for scenarios where some replica states can be > > > changed > > > > and > > > > > > > some > > > > > > > > can't. > > > > > > > > > > > > > > > > Does EventType need to be an Int32? For enums, we usually > use > > > the > > > > > > > > smallest reasonable type, which would be Int8 here. We can > > > always > > > > > > change > > > > > > > > the schema later if needed. UNKNOWN_REPLICA_EVENT_TYPE seems > > > > > > unnecessary > > > > > > > > since INVALID_REQUEST covers this case. > > > > > > > > > > > > > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a > > > > slightly > > > > > > > > better name for this RPC. > > > > > > > > > > > > > > > > cheers, > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote: > > > > > > > > > Hey everyone, > > > > > > > > > > > > > > > > > > I'd like to start the discussion for KIP-589, part of the > > > KIP-500 > > > > > > > effort > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller > > > > > > > > > > > > > > > > > > This is a proposal to use a new RPC instead of ZooKeeper > for > > > > > > notifying > > > > > > > > the > > > > > > > > > controller of an offline replica. Please give a read and > let > > me > > > > know > > > > > > > your > > > > > > > > > thoughts. > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > David Arthur > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > David Arthur > > > -- David Arthur