On Wed, Aug 7, 2019, at 12:48, George Li wrote: > Hi Colin, > > Thanks for your feedbacks. Comments below: > > Even if you have a way of blacklisting an entire broker all at once, you > >still would need to run a leader election > for each partition where you > >want to move the leader off of the blacklisted broker. So the operation is > >still O(N) in > that sense-- you have to do something per partition. > > For a failed broker and swapped with an empty broker, when it comes up, > it will not have any leadership, and we would like it to remain not > having leaderships for a couple of hours or days. So there is no > preferred leader election needed which incurs O(N) operation in this > case. Putting the preferred leader blacklist would safe guard this > broker serving traffic during that time. otherwise, if another broker > fails(if this broker is the 1st, 2nd in the assignment), or someone > runs preferred leader election, this new "empty" broker can still get > leaderships. > > Also running reassignment to change the ordering of preferred leader > would not actually switch the leader automatically. e.g. (1,2,3) => > (2,3,1). unless preferred leader election is run to switch current > leader from 1 to 2. So the operation is at least 2 x O(N). and then > after the broker is back to normal, another 2 x O(N) to rollback.
Hi George, Hmm. I guess I'm still on the fence about this feature. In your example, I think we're comparing apples and oranges. You started by outlining a scenario where "an empty broker... comes up... [without] any leadership[s]." But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader automatically." If the empty broker doesn't have any leaderships, there is nothing to be switched, right? > > > > In general, reassignment will get a lot easier and quicker once KIP-455 is > > implemented. > Reassignments that just change the order of preferred > > replicas for a specific partition should complete pretty much instantly. > >> I think it's simpler and easier just to have one source of truth for what > >> the preferred replica is for a partition, rather than two. So for> me, > >> the fact that the replica assignment ordering isn't changed is actually a > >> big disadvantage of this KIP. If you are a new user (or just> an > >> existing user that didn't read all of the documentation) and you just look > >> at the replica assignment, you might be confused by why> a particular > >> broker wasn't getting any leaderships, even though it appeared like it > >> should. More mechanisms mean more complexity> for users and developers > >> most of the time. > > > I would like stress the point that running reassignment to change the > ordering of the replica (putting a broker to the end of partition > assignment) is unnecessary, because after some time the broker is > caught up, it can start serving traffic and then need to run > reassignments again to "rollback" to previous states. As I mentioned in > KIP-491, this is just tedious work. In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing. We expect more and more people who have a complex or large cluster will start using tools like this. However, if you choose to do manual rebalancing, it shouldn't be that bad. You would save the existing partition ordering before making your changes, then make your changes (perhaps by running a simple command line tool that switches the order of the replicas). Then, once you felt like the broker was ready to serve traffic, you could just re-apply the old ordering which you had saved. > > I agree this might introduce some complexities for users/developers. > But if this feature is good, and well documented, it is good for the > kafka product/community. Just like KIP-460 enabling unclean leader > election to override TopicLevel/Broker Level config of > `unclean.leader.election.enable` > > > I agree that it would be nice if we could treat some brokers differently > > for the purposes of placing replicas, selecting leaders, etc. > Right now, > > we don't have any way of implementing that without forking the broker. I > > would support a new PlacementPolicy class that> would close this gap. But > > I don't think this KIP is flexible enough to fill this role. For example, > > it can't prevent users from creating> new single-replica topics that get > > put on the "bad" replica. Perhaps we should reopen the discussion> about > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces > > Creating topic with single-replica is beyond what KIP-491 is trying to > achieve. The user needs to take responsibility of doing that. I do see > some Samza clients notoriously creating single-replica topics and that > got flagged by alerts, because a single broker down/maintenance will > cause offline partitions. For KIP-491 preferred leader "blacklist", > the single-replica will still serve as leaders, because there is no > other alternative replica to be chosen as leader. > > Even with a new PlacementPolicy for topic creation/partition expansion, > it still needs the blacklist info (e.g. a zk path node, or broker > level/topic level config) to "blacklist" the broker to be preferred > leader? Would it be the same as KIP-491 is introducing? I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to ever be the leader. I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing itduring random topic creation. I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys. I agree static configuration keys are somewhat less flexible than dynamic configuration. best, Colin > > > Thanks, > George > > On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe > <cmcc...@apache.org> wrote: > > On Fri, Aug 2, 2019, at 20:02, George Li wrote: > > Hi Colin, > > Thanks for looking into this KIP. Sorry for the late response. been busy. > > > > If a cluster has MAMY topic partitions, moving this "blacklist" broker > > to the end of replica list is still a rather "big" operation, involving > > submitting reassignments. The KIP-491 way of blacklist is much > > simpler/easier and can undo easily without changing the replica > > assignment ordering. > > Hi George, > > Even if you have a way of blacklisting an entire broker all at once, > you still would need to run a leader election for each partition where > you want to move the leader off of the blacklisted broker. So the > operation is still O(N) in that sense-- you have to do something per > partition. > > In general, reassignment will get a lot easier and quicker once KIP-455 > is implemented. Reassignments that just change the order of preferred > replicas for a specific partition should complete pretty much instantly. > > I think it's simpler and easier just to have one source of truth for > what the preferred replica is for a partition, rather than two. So for > me, the fact that the replica assignment ordering isn't changed is > actually a big disadvantage of this KIP. If you are a new user (or > just an existing user that didn't read all of the documentation) and > you just look at the replica assignment, you might be confused by why a > particular broker wasn't getting any leaderships, even though it > appeared like it should. More mechanisms mean more complexity for > users and developers most of the time. > > > Major use case for me, a failed broker got swapped with new hardware, > > and starts up as empty (with latest offset of all partitions), the SLA > > of retention is 1 day, so before this broker is up to be in-sync for 1 > > day, we would like to blacklist this broker from serving traffic. after > > 1 day, the blacklist is removed and run preferred leader election. > > This way, no need to run reassignments before/after. This is the > > "temporary" use-case. > > What if we just add an option to the reassignment tool to generate a > plan to move all the leaders off of a specific broker? The tool could > also run a leader election as well. That would be a simple way of > doing this without adding new mechanisms or broker-side configurations, > etc. > > > > > There are use-cases that this Preferred Leader "blacklist" can be > > somewhat permanent, as I explained in the AWS data center instances Vs. > > on-premises data center bare metal machines (heterogenous hardware), > > that the AWS broker_ids will be blacklisted. So new topics created, > > or existing topic expansion would not make them serve traffic even they > > could be the preferred leader. > > I agree that it would be nice if we could treat some brokers > differently for the purposes of placing replicas, selecting leaders, > etc. Right now, we don't have any way of implementing that without > forking the broker. I would support a new PlacementPolicy class that > would close this gap. But I don't think this KIP is flexible enough to > fill this role. For example, it can't prevent users from creating new > single-replica topics that get put on the "bad" replica. Perhaps we > should reopen the discussion about > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces > > regards, > Colin > > > > > Please let me know there are more question. > > > > > > Thanks, > > George > > > > On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe > > <cmcc...@apache.org> wrote: > > > > We still want to give the "blacklisted" broker the leadership if > > nobody else is available. Therefore, isn't putting a broker on the > > blacklist pretty much the same as moving it to the last entry in the > > replicas list and then triggering a preferred leader election? > > > > If we want this to be undone after a certain amount of time, or under > > certain conditions, that seems like something that would be more > > effectively done by an external system, rather than putting all these > > policies into Kafka. > > > > best, > > Colin > > > > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote: > > > Hi Satish, > > > Thanks for the reviews and feedbacks. > > > > > > > > The following is the requirements this KIP is trying to accomplish: > > > > This can be moved to the"Proposed changes" section. > > > > > > Updated the KIP-491. > > > > > > > >>The logic to determine the priority/order of which broker should be > > > > preferred leader should be modified. The broker in the preferred leader > > > > blacklist should be moved to the end (lowest priority) when > > > > determining leadership. > > > > > > > > I believe there is no change required in the ordering of the preferred > > > > replica list. Brokers in the preferred leader blacklist are skipped > > > > until other brokers int he list are unavailable. > > > > > > Yes. partition assignment remained the same, replica & ordering. The > > > blacklist logic can be optimized during implementation. > > > > > > > >>The blacklist can be at the broker level. However, there might be use > > > > >>cases > > > > where a specific topic should blacklist particular brokers, which > > > > would be at the > > > > Topic level Config. For this use cases of this KIP, it seems that > > > > broker level > > > > blacklist would suffice. Topic level preferred leader blacklist might > > > > be future enhancement work. > > > > > > > > I agree that the broker level preferred leader blacklist would be > > > > sufficient. Do you have any use cases which require topic level > > > > preferred blacklist? > > > > > > > > > > > > I don't have any concrete use cases for Topic level preferred leader > > > blacklist. One scenarios I can think of is when a broker has high CPU > > > usage, trying to identify the big topics (High MsgIn, High BytesIn, > > > etc), then try to move the leaders away from this broker, before doing > > > an actual reassignment to change its preferred leader, try to put this > > > preferred_leader_blacklist in the Topic Level config, and run preferred > > > leader election, and see whether CPU decreases for this broker, if > > > yes, then do the reassignments to change the preferred leaders to be > > > "permanent" (the topic may have many partitions like 256 that has quite > > > a few of them having this broker as preferred leader). So this Topic > > > Level config is an easy way of doing trial and check the result. > > > > > > > > > > You can add the below workaround as an item in the rejected > > > > alternatives section > > > > "Reassigning all the topic/partitions which the intended broker is a > > > > replica for." > > > > > > Updated the KIP-491. > > > > > > > > > > > > Thanks, > > > George > > > > > > On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana > > > <satish.dugg...@gmail.com> wrote: > > > > > > Thanks for the KIP. I have put my comments below. > > > > > > This is a nice improvement to avoid cumbersome maintenance. > > > > > > >> The following is the requirements this KIP is trying to accomplish: > > > The ability to add and remove the preferred leader deprioritized > > > list/blacklist. e.g. new ZK path/node or new dynamic config. > > > > > > This can be moved to the"Proposed changes" section. > > > > > > >>The logic to determine the priority/order of which broker should be > > > preferred leader should be modified. The broker in the preferred leader > > > blacklist should be moved to the end (lowest priority) when > > > determining leadership. > > > > > > I believe there is no change required in the ordering of the preferred > > > replica list. Brokers in the preferred leader blacklist are skipped > > > until other brokers int he list are unavailable. > > > > > > >>The blacklist can be at the broker level. However, there might be use > > > >>cases > > > where a specific topic should blacklist particular brokers, which > > > would be at the > > > Topic level Config. For this use cases of this KIP, it seems that broker > > > level > > > blacklist would suffice. Topic level preferred leader blacklist might > > > be future enhancement work. > > > > > > I agree that the broker level preferred leader blacklist would be > > > sufficient. Do you have any use cases which require topic level > > > preferred blacklist? > > > > > > You can add the below workaround as an item in the rejected alternatives > > > section > > > "Reassigning all the topic/partitions which the intended broker is a > > > replica for." > > > > > > Thanks, > > > Satish. > > > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski > > > <stanis...@confluent.io> wrote: > > > > > > > > Hey George, > > > > > > > > Thanks for the KIP, it's an interesting idea. > > > > > > > > I was wondering whether we could achieve the same thing via the > > > > kafka-reassign-partitions tool. As you had also said in the JIRA, it is > > > > true that this is currently very tedious with the tool. My thoughts are > > > > that we could improve the tool and give it the notion of a "blacklisted > > > > preferred leader". > > > > This would have some benefits like: > > > > - more fine-grained control over the blacklist. we may not want to > > > > blacklist all the preferred leaders, as that would make the blacklisted > > > > broker a follower of last resort which is not very useful. In the cases > > > > of > > > > an underpowered AWS machine or a controller, you might overshoot and > > > > make > > > > the broker very underutilized if you completely make it leaderless. > > > > - is not permanent. If we are to have a blacklist leaders config, > > > > rebalancing tools would also need to know about it and > > > > manipulate/respect > > > > it to achieve a fair balance. > > > > It seems like both problems are tied to balancing partitions, it's just > > > > that KIP-491's use case wants to balance them against other factors in a > > > > more nuanced way. It makes sense to have both be done from the same > > > > place > > > > > > > > To make note of the motivation section: > > > > > Avoid bouncing broker in order to lose its leadership > > > > The recommended way to make a broker lose its leadership is to run a > > > > reassignment on its partitions > > > > > The cross-data center cluster has AWS cloud instances which have less > > > > computing power > > > > We recommend running Kafka on homogeneous machines. It would be cool if > > > > the > > > > system supported more flexibility in that regard but that is more > > > > nuanced > > > > and a preferred leader blacklist may not be the best first approach to > > > > the > > > > issue > > > > > > > > Adding a new config which can fundamentally change the way replication > > > > is > > > > done is complex, both for the system (the replication code is complex > > > > enough) and the user. Users would have another potential config that > > > > could > > > > backfire on them - e.g if left forgotten. > > > > > > > > Could you think of any downsides to implementing this functionality (or > > > > a > > > > variation of it) in the kafka-reassign-partitions.sh tool? > > > > One downside I can see is that we would not have it handle new > > > > partitions > > > > created after the "blacklist operation". As a first iteration I think > > > > that > > > > may be acceptable > > > > > > > > Thanks, > > > > Stanislav > > > > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li > > > > <sql_consult...@yahoo.com.invalid> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Pinging the list for the feedbacks of this KIP-491 ( > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982 > > > > > ) > > > > > > > > > > > > > > > Thanks, > > > > > George > > > > > > > > > > On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li < > > > > > sql_consult...@yahoo.com.INVALID> wrote: > > > > > > > > > > Hi, > > > > > > > > > > I have created KIP-491 ( > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982) > > > > > for putting a broker to the preferred leader blacklist or > > > > > deprioritized > > > > > list so when determining leadership, it's moved to the lowest > > > > > priority for > > > > > some of the listed use-cases. > > > > > > > > > > Please provide your comments/feedbacks. > > > > > > > > > > Thanks, > > > > > George > > > > > > > > > > > > > > > > > > > > ----- Forwarded Message ----- From: Jose Armando Garcia Sancio > > > > >(JIRA) < > > > > > j...@apache.org>To: "sql_consult...@yahoo.com" > > > > > <sql_consult...@yahoo.com>Sent: > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented] > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list) > > > > > > > > > > [ > > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511 > > > > > ] > > > > > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638: > > > > > --------------------------------------------------- > > > > > > > > > > Thanks for feedback and clear use cases [~sql_consulting]. > > > > > > > > > > > Preferred Leader Blacklist (deprioritized list) > > > > > > ----------------------------------------------- > > > > > > > > > > > > Key: KAFKA-8638 > > > > > > URL: https://issues.apache.org/jira/browse/KAFKA-8638 > > > > > > Project: Kafka > > > > > > Issue Type: Improvement > > > > > > Components: config, controller, core > > > > > > Affects Versions: 1.1.1, 2.3.0, 2.2.1 > > > > > > Reporter: GEORGE LI > > > > > > Assignee: GEORGE LI > > > > > > Priority: Major > > > > > > > > > > > > Currently, the kafka preferred leader election will pick the > > > > > > broker_id > > > > > in the topic/partition replica assignments in a priority order when > > > > > the > > > > > broker is in ISR. The preferred leader is the broker id in the first > > > > > position of replica. There are use-cases that, even the first broker > > > > > in the > > > > > replica assignment is in ISR, there is a need for it to be moved to > > > > > the end > > > > > of ordering (lowest priority) when deciding leadership during > > > > > preferred > > > > > leader election. > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the > > > > > preferred leader. When preferred leadership is run, it will pick 1 > > > > > as the > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 > > > > > is not > > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 > > > > > is in > > > > > ISR, we would like it to be moved to the end of ordering (lowest > > > > > priority) > > > > > when deciding leadership during preferred leader election. Below is > > > > > a list > > > > > of use cases: > > > > > > * (If broker_id 1 is a swapped failed host and brought up with last > > > > > segments or latest offset without historical data (There is another > > > > > effort > > > > > on this), it's better for it to not serve leadership till it's > > > > > caught-up. > > > > > > * The cross-data center cluster has AWS instances which have less > > > > > computing power than the on-prem bare metal machines. We could put > > > > > the AWS > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be > > > > > elected > > > > > leaders, without changing the reassignments ordering of the replicas. > > > > > > * If the broker_id 1 is constantly losing leadership after some > > > > > > time: > > > > > "Flapping". we would want to exclude 1 to be a leader unless all other > > > > > brokers of this topic/partition are offline. The “Flapping” effect > > > > > was > > > > > seen in the past when 2 or more brokers were bad, when they lost > > > > > leadership > > > > > constantly/quickly, the sets of partition replicas they belong to > > > > > will see > > > > > leadership constantly changing. The ultimate solution is to swap > > > > > these bad > > > > > hosts. But for quick mitigation, we can also put the bad hosts in the > > > > > Preferred Leader Blacklist to move the priority of its being elected > > > > > as > > > > > leaders to the lowest. > > > > > > * If the controller is busy serving an extra load of metadata > > > > > > requests > > > > > and other tasks. we would like to put the controller's leaders to > > > > > other > > > > > brokers to lower its CPU load. currently bouncing to lose leadership > > > > > would > > > > > not work for Controller, because after the bounce, the controller > > > > > fails > > > > > over to another broker. > > > > > > * Avoid bouncing broker in order to lose its leadership: it would be > > > > > good if we have a way to specify which broker should be excluded from > > > > > serving traffic/leadership (without changing the replica assignment > > > > > ordering by reassignments, even though that's quick), and run > > > > > preferred > > > > > leader election. A bouncing broker will cause temporary URP, and > > > > > sometimes > > > > > other issues. Also a bouncing of broker (e.g. broker_id 1) can > > > > > temporarily > > > > > lose all its leadership, but if another broker (e.g. broker_id 2) > > > > > fails or > > > > > gets bounced, some of its leaderships will likely failover to > > > > > broker_id 1 > > > > > on a replica with 3 brokers. If broker_id 1 is in the blacklist, > > > > > then in > > > > > such a scenario even broker_id 2 offline, the 3rd broker can take > > > > > leadership. > > > > > > The current work-around of the above is to change the > > > > > > topic/partition's > > > > > replica reassignments to move the broker_id 1 from the first position > > > > > to > > > > > the last position and run preferred leader election. e.g. (1, 2, 3) > > > > > => (2, > > > > > 3, 1). This changes the replica reassignments, and we need to keep > > > > > track of > > > > > the original one and restore if things change (e.g. controller fails > > > > > over > > > > > to another broker, the swapped empty broker caught up). That’s a > > > > > rather > > > > > tedious task. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > This message was sent by Atlassian JIRA > > > > > (v7.6.3#76005)