Hey Colin & George, Thinking on George's points I was wondering if it's feasible to submit a big reassignment to the controller and thus Zookeeper as frequent writes are slow as the quorum has to synchronize. Perhaps it should be the responsibility of KIP-435 <https://issues.apache.org/jira/browse/KIP-435> but I'd like to note it here as we're changing the current znode layout in this KIP. I think ideally we should add these writes in batches to zookeeper and otherwise store it in a replicated internal topic (__partition_reassignments). That would solve the scalability problem as the failover controller would be able to read it up very quickly and also we would spread the writes in Zookeeper over time. Just the current, actively replicated partitions should be present under /brokers/topics/[topic]/partitions/[partitionId]/state, so those partitions will know if they have to do reassignment (even in case of a broker bounce). The controller on the other hand could regain its state by reading up the last produced message from this __partition_reassignments topic and reading up the Zookeeper state to figure out which batch its currently doing (supposing it goes sequentially in the given reassignment). I'll think a little bit more about this to fill out any gaps there are and perhaps add it to my KIP. That being said probably we'll need to make some benchmarking first if this bulk read-write causes a problem at all to avoid premature optimisation. I generally don't really worry about reading up this new information as the controller would read up the assignment anyway in initializeControllerContext().
A question on SubmitPartitionReassignmentsRequest and its connection with KIP-435 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-435>. Would the list of topic-partitions have the same ordering on the client side as well as the broker side? I think it would be an advantage as the user would know in which order the reassignment would be performed. I think it's useful when it comes to incrementalization as they'd be able to figure out what replicas will be in one batch (given they know about the batch size). Viktor On Wed, May 1, 2019 at 8:33 AM George Li <sql_consult...@yahoo.com.invalid> wrote: > Hi Colin, > > Thanks for KIP-455! yes. KIP-236, etc. will depend on it. It is the good > direction to go for the RPC. > > Regarding storing the new reassignments & original replicas at the > topic/partition level. I have some concerns when controller is failing > over, and the scalability of scanning the active reassignments from ZK > topic/partition level nodes. Please see my reply to Jason in the KIP-236 > thread. > > Once the decision is made where new reassignment and original replicas is > stored, I will modify KIP-236 accordingly for how to cancel/rollback the > reassignments. > > Thanks, > George > > > On Monday, April 15, 2019, 6:07:44 PM PDT, Colin McCabe < > cmcc...@apache.org> wrote: > > Hi all, > > We've been having discussions on a few different KIPs (KIP-236, KIP-435, > etc.) about what the Admin Client replica reassignment API should look > like. The current API is really hard to extend and maintain, which is a > big source of problems. I think it makes sense to have a KIP that > establishes a clean API that we can use and extend going forward, so I > posted KIP-455. Take a look. :) > > best, > Colin >