Hi Colin, Thanks for explaining all this, it makes sense.
Viktor On Sun, May 5, 2019 at 8:18 AM Colin McCabe <cmcc...@apache.org> wrote: > On Thu, May 2, 2019, at 09:35, Viktor Somogyi-Vass wrote: > > Hey Colin & George, > > > > Thinking on George's points I was wondering if it's feasible to submit a > > big reassignment to the controller and thus Zookeeper as frequent writes > > are slow as the quorum has to synchronize. Perhaps it should be the > > responsibility of KIP-435 <https://issues.apache.org/jira/browse/KIP-435> > but > > I'd like to note it here as we're changing the current znode layout in > this > > KIP. > > Hi Viktor, > > This is similar conceptually to if we lose a broker from the cluster. In > that case, we have to remove that node from the ISR of all the partitions > it has, which means updating O(partitions_on_node) znodes. It's also > similar to completing a reassignment in the existing Kafka version, and > updating the partition znodes to reflect new nodes joining the ISR for > various partitions. While you are right that ZK is a low-bandwidth system, > in general writing, to a few thousand ZNodes over the course of a second or > two is OK. > > The existing reassignment znode requires the whole plan to fit within a > single znode. The maximum znodes size of 1 megabyte by default, and almost > nobody reconfigures this. Assuming about 100 bytes per reassignment, we > can't get many more than about 10,000 partitions in a reassignment today in > any case. The current scalability bottleneck is much more on the side of > "can kafka actually handle a huge amount of extra traffic due to ongoing > reassignments"? > > That does bring up a good point, though-- we may want to have a "maximum > concurrent reassignments" to avoid a common scenario that happens now, > where people accidentally submit a plan that's way too big. But this is > not to protect ZooKeeper-- it is to protect the brokers. > > > I think ideally we should add these writes in batches to zookeeper and > > otherwise store it in a replicated internal topic > > (__partition_reassignments). That would solve the scalability problem as > > the failover controller would be able to read it up very quickly and also > > we would spread the writes in Zookeeper over time. Just the current, > > actively replicated partitions should be present under > > /brokers/topics/[topic]/partitions/[partitionId]/state, so those > partitions > > will know if they have to do reassignment (even in case of a broker > > bounce). The controller on the other hand could regain its state by > reading > > up the last produced message from this __partition_reassignments topic > and > > reading up the Zookeeper state to figure out which batch its currently > > doing (supposing it goes sequentially in the given reassignment). > > As I wrote in my reply to the other email, this is not needed because > we're not adding any controller startup overhead beyond what already > exists. We do have some plans to optimize this, but it's outside the scope > of this KIP. > > > I'll think a little bit more about this to fill out any gaps there are > and > > perhaps add it to my KIP. That being said probably we'll need to make > some > > benchmarking first if this bulk read-write causes a problem at all to > avoid > > premature optimisation. I generally don't really worry about reading up > > this new information as the controller would read up the assignment > anyway > > in initializeControllerContext(). > > Right, the controller will read those znodes on startup anyway. > > > > > A question on SubmitPartitionReassignmentsRequest and its connection with > > KIP-435 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-435>. > Would > > the list of topic-partitions have the same ordering on the client side as > > well as the broker side? I think it would be an advantage as the user > would > > know in which order the reassignment would be performed. I think it's > > useful when it comes to incrementalization as they'd be able to figure > out > > what replicas will be in one batch (given they know about the batch > size). > > The big advantage of doing batching on the controller is that the > controller has more information about what is going on in the cluster. So > it can schedule reassignments in a more optimal way. For instance, it can > schedule reassignments so that the load is distributed evenly across > nodes. This advantage is lost if we have to adhere to a rigid ordering > that is set up in advance. We don't know exactly when anything will > complete in any case. Just because one partition reassignment was started > before another doesn't mean it will finish before another. > > Additionally, there may be multiple clients submitting assignments and > multiple clients querying them. So I don't think ordering makes sense here. > > best, > Colin > > > > > Viktor > > > > On Wed, May 1, 2019 at 8:33 AM George Li <sql_consult...@yahoo.com > .invalid> > > wrote: > > > > > Hi Colin, > > > > > > Thanks for KIP-455! yes. KIP-236, etc. will depend on it. It is the > good > > > direction to go for the RP > > > > > > Regarding storing the new reassignments & original replicas at the > > > topic/partition level. I have some concerns when controller is failing > > > over, and the scalability of scanning the active reassignments from ZK > > > topic/partition level nodes. Please see my reply to Jason in the > KIP-236 > > > thread. > > > > > > Once the decision is made where new reassignment and original replicas > is > > > stored, I will modify KIP-236 accordingly for how to cancel/rollback > the > > > reassignments. > > > > > > Thanks, > > > George > > > > > > > > > On Monday, April 15, 2019, 6:07:44 PM PDT, Colin McCabe < > > > cmcc...@apache.org> wrote: > > > > > > Hi all, > > > > > > We've been having discussions on a few different KIPs (KIP-236, > KIP-435, > > > etc.) about what the Admin Client replica reassignment API should look > > > like. The current API is really hard to extend and maintain, which is > a > > > big source of problems. I think it makes sense to have a KIP that > > > establishes a clean API that we can use and extend going forward, so I > > > posted KIP-455. Take a look. :) > > > > > > best, > > > Colin > > > > > >