I agree with Colin that the same result should be achievable through proper abstraction in a tool. Even if that might be "4xO(N)" operations, that is still not a lot - it is still classified as O(N)
Let's say a healthy broker hosting 3000 partitions, and of which 1000 are > the preferred leaders (leader count is 1000). There is a hardware failure > (disk/memory, etc.), and kafka process crashed. We swap this host with > another host but keep the same broker.id, when this new broker coming up, > it has no historical data, and we manage to have the current last offsets > of all partitions set in the replication-offset-checkpoint (if we don't set > them, it could cause crazy ReplicaFetcher pulling of historical data from > other brokers and cause cluster high latency and other instabilities), so > when Kafka is brought up, it is quickly catching up as followers in the > ISR. Note, we have auto.leader.rebalance.enable disabled, so it's not > serving any traffic as leaders (leader count = 0), even there are 1000 > partitions that this broker is the Preferred Leader. > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after it's > having enough historical data. This sounds like a bit of a hack. If that is the concern, why not propose a KIP that addresses the specific issue? Having a blacklist you control still seems like a workaround given that Kafka itself knows when the topic retention would allow you to switch that replica to a leader I really hope we can come up with a solution that avoids complicating the controller and state machine logic further. Could you please list out the main drawbacks of abstract this away in the reassignments tool (or a new tool)? On Mon, Sep 9, 2019 at 7:53 AM Colin McCabe <cmcc...@apache.org> wrote: > On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote: > > Hi Colin, > > Can you give us more details on why you don't want this to be > > part of the Kafka core. You are proposing KIP-500 which will take away > > zookeeper and writing this interim tools to change the zookeeper > > metadata doesn't make sense to me. > > Hi Harsha, > > The reassignment API described in KIP-455, which will be part of Kafka > 2.4, doesn't rely on ZooKeeper. This API will stay the same after KIP-500 > is implemented. > > > As George pointed out there are > > several benefits having it in the system itself instead of asking users > > to hack bunch of json files to deal with outage scenario. > > In both cases, the user just has to run a shell command, right? In both > cases, the user has to remember to undo the command later when they want > the broker to be treated normally again. And in both cases, the user > should probably be running an external rebalancing tool to avoid having to > run these commands manually. :) > > best, > Colin > > > > > Thanks, > > Harsha > > > > On Fri, Sep 6, 2019 at 4:36 PM George Li <sql_consult...@yahoo.com > .invalid> > > wrote: > > > > > Hi Colin, > > > > > > Thanks for the feedback. The "separate set of metadata about > blacklists" > > > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple > in > > > the cluster. Should be easier than keeping json files? e.g. what if > we > > > first blacklist broker_id_1, then another broker_id_2 has issues, and > we > > > need to write out another json file to restore later (and in which > order)? > > > Using blacklist, we can just add the broker_id_2 to the existing one. > and > > > remove whatever broker_id returning to good state without worrying > how(the > > > ordering of putting the broker to blacklist) to restore. > > > > > > For topic level config, the blacklist will be tied to > > > topic/partition(e.g. Configs: > > > topic.preferred.leader.blacklist=0:101,102;1:103 where 0 & 1 is the > > > partition#, 101,102,103 are the blacklist broker_ids), and easier to > > > update/remove, no need for external json files? > > > > > > > > > Thanks, > > > George > > > > > > On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe < > > > cmcc...@apache.org> wrote: > > > > > > One possibility would be writing a new command-line tool that would > > > deprioritize a given replica using the new KIP-455 API. Then it could > > > write out a JSON files containing the old priorities, which could be > > > restored when (or if) we needed to do so. This seems like it might be > > > simpler and easier to maintain than a separate set of metadata about > > > blacklists. > > > > > > best, > > > Colin > > > > > > > > > On Fri, Sep 6, 2019, at 11:58, George Li wrote: > > > > Hi, > > > > > > > > Just want to ping and bubble up the discussion of KIP-491. > > > > > > > > On a large scale of Kafka clusters with thousands of brokers in many > > > > clusters. Frequent hardware failures are common, although the > > > > reassignments to change the preferred leaders is a workaround, it > > > > incurs unnecessary additional work than the proposed preferred leader > > > > blacklist in KIP-491, and hard to scale. > > > > > > > > I am wondering whether others using Kafka in a big scale running into > > > > same problem. > > > > > > > > > > > > Satish, > > > > > > > > Regarding your previous question about whether there is use-case for > > > > TopicLevel preferred leader "blacklist", I thought about one > > > > use-case: to improve rebalance/reassignment, the large partition > will > > > > usually cause performance/stability issues, planning to change the > say > > > > the New Replica will start with Leader's latest offset(this way the > > > > replica is almost instantly in the ISR and reassignment completed), > and > > > > put this partition's NewReplica into Preferred Leader "Blacklist" at > > > > the Topic Level config for that partition. After sometime(retention > > > > time), this new replica has caught up and ready to serve traffic, > > > > update/remove the TopicConfig for this partition's preferred leader > > > > blacklist. > > > > > > > > I will update the KIP-491 later for this use case of Topic Level > config > > > > for Preferred Leader Blacklist. > > > > > > > > > > > > Thanks, > > > > George > > > > > > > > On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li > > > > <sql_consult...@yahoo.com> wrote: > > > > > > > > Hi Colin, > > > > > > > > > In your example, I think we're comparing apples and oranges. You > > > started by outlining a scenario where "an empty broker... comes up... > > > [without] any > leadership[s]." But then you criticize using > reassignment > > > to switch the order of preferred replicas because it "would not > actually > > > switch the leader > automatically." If the empty broker doesn't have > any > > > leaderships, there is nothing to be switched, right? > > > > > > > > Let me explained in details of this particular use case example for > > > > comparing apples to apples. > > > > > > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > > > > are the preferred leaders (leader count is 1000). There is a hardware > > > > failure (disk/memory, etc.), and kafka process crashed. We swap this > > > > host with another host but keep the same broker.id, when this new > > > > broker coming up, it has no historical data, and we manage to have > the > > > > current last offsets of all partitions set in > > > > the replication-offset-checkpoint (if we don't set them, it could > cause > > > > crazy ReplicaFetcher pulling of historical data from other brokers > and > > > > cause cluster high latency and other instabilities), so when Kafka is > > > > brought up, it is quickly catching up as followers in the ISR. Note, > > > > we have auto.leader.rebalance.enable disabled, so it's not serving > any > > > > traffic as leaders (leader count = 0), even there are 1000 partitions > > > > that this broker is the Preferred Leader. > > > > > > > > We need to make this broker not serving traffic for a few hours or > days > > > > depending on the SLA of the topic retention requirement until after > > > > it's having enough historical data. > > > > > > > > > > > > * The traditional way using the reassignments to move this broker in > > > > that 1000 partitions where it's the preferred leader to the end of > > > > assignment, this is O(N) operation. and from my experience, we can't > > > > submit all 1000 at the same time, otherwise cause higher latencies > even > > > > the reassignment in this case can complete almost instantly. After > a > > > > few hours/days whatever, this broker is ready to serve traffic, we > > > > have to run reassignments again to restore that 1000 partitions > > > > preferred leaders for this broker: O(N) operation. then run > preferred > > > > leader election O(N) again. So total 3 x O(N) operations. The point > > > > is since the new empty broker is expected to be the same as the old > one > > > > in terms of hosting partition/leaders, it would seem unnecessary to > do > > > > reassignments (ordering of replica) during the broker catching up > time. > > > > > > > > > > > > > > > > * The new feature Preferred Leader "Blacklist": just need to put a > > > > dynamic config to indicate that this broker should be considered > leader > > > > (preferred leader election or broker failover or unclean leader > > > > election) to the lowest priority. NO need to run any reassignments. > > > > After a few hours/days, when this broker is ready, remove the dynamic > > > > config, and run preferred leader election and this broker will serve > > > > traffic for that 1000 original partitions it was the preferred > leader. > > > > So total 1 x O(N) operation. > > > > > > > > > > > > If auto.leader.rebalance.enable is enabled, the Preferred Leader > > > > "Blacklist" can be put it before Kafka is started to prevent this > > > > broker serving traffic. In the traditional way of running > > > > reassignments, once the broker is up, > > > > with auto.leader.rebalance.enable , if leadership starts going to > this > > > > new empty broker, it might have to do preferred leader election after > > > > reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) > > > > reassignment only change the ordering, 1 remains as the current > leader, > > > > and needs prefer leader election to change to 2 after reassignment. > so > > > > potentially one more O(N) operation. > > > > > > > > I hope the above example can show how easy to "blacklist" a broker > > > > serving leadership. For someone managing Production Kafka cluster, > > > > it's important to react fast to certain alerts and mitigate/resolve > > > > some issues. As I listed the other use cases in KIP-291, I think this > > > > feature can make the Kafka product more easier to manage/operate. > > > > > > > > > In general, using an external rebalancing tool like Cruise Control > is > > > a good idea to keep things balanced without having deal with manual > > > rebalancing. > We expect more and more people who have a complex or > large > > > cluster will start using tools like this. > > > > > > > > > > However, if you choose to do manual rebalancing, it shouldn't be > that > > > bad. You would save the existing partition ordering before making your > > > changes, then> make your changes (perhaps by running a simple command > line > > > tool that switches the order of the replicas). Then, once you felt > like > > > the broker was ready to> serve traffic, you could just re-apply the old > > > ordering which you had saved. > > > > > > > > > > > > We do have our own rebalancing tool which has its own criteria like > > > > Rack diversity, disk usage, spread partitions/leaders across all > > > > brokers in the cluster per topic, leadership Bytes/BytesIn served per > > > > broker, etc. We can run reassignments. The point is whether it's > > > > really necessary, and if there is more effective, easier, safer way > to > > > > do it. > > > > > > > > take another use case example of taking leadership out of busy > > > > Controller to give it more power to serve metadata requests and other > > > > work. The controller can failover, with the preferred leader > > > > "blacklist", it does not have to run reassignments again when > > > > controller failover, just change the blacklisted broker_id. > > > > > > > > > > > > > I was thinking about a PlacementPolicy filling the role of > preventing > > > people from creating single-replica partitions on a node that we didn't > > > want to > ever be the leader. I thought that it could also prevent > people > > > from designating those nodes as preferred leaders during topic > creation, or > > > Kafka from doing> itduring random topic creation. I was assuming that > the > > > PlacementPolicy would determine which nodes were which through static > > > configuration keys. I agree> static configuration keys are somewhat > less > > > flexible than dynamic configuration. > > > > > > > > > > > > I think single-replica partition might not be a good example. There > > > > should not be any single-replica partition at all. If yes. it's > > > > probably because of trying to save disk space with less replicas. I > > > > think at least minimum 2. The user purposely creating single-replica > > > > partition will take full responsibilities of data loss and > > > > unavailability when a broker fails or under maintenance. > > > > > > > > > > > > I think it would be better to use dynamic instead of static config. > I > > > > also think it would be better to have topic creation Policy enforced > in > > > > Kafka server OR an external service. We have an external/central > > > > service managing topic creation/partition expansion which takes into > > > > account of rack-diversity, replication factor (2, 3 or 4 depending on > > > > cluster/topic type), Policy replicating the topic between kafka > > > > clusters, etc. > > > > > > > > > > > > > > > > Thanks, > > > > George > > > > > > > > > > > > On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe > > > > <cmcc...@apache.org> wrote: > > > > > > > > On Wed, Aug 7, 2019, at 12:48, George Li wrote: > > > > > Hi Colin, > > > > > > > > > > Thanks for your feedbacks. Comments below: > > > > > > Even if you have a way of blacklisting an entire broker all at > once, > > > you still would need to run a leader election > for each partition > where > > > you want to move the leader off of the blacklisted broker. So the > > > operation is still O(N) in > that sense-- you have to do something per > > > partition. > > > > > > > > > > For a failed broker and swapped with an empty broker, when it comes > > > up, > > > > > it will not have any leadership, and we would like it to remain not > > > > > having leaderships for a couple of hours or days. So there is no > > > > > preferred leader election needed which incurs O(N) operation in > this > > > > > case. Putting the preferred leader blacklist would safe guard this > > > > > broker serving traffic during that time. otherwise, if another > broker > > > > > fails(if this broker is the 1st, 2nd in the assignment), or someone > > > > > runs preferred leader election, this new "empty" broker can still > get > > > > > leaderships. > > > > > > > > > > Also running reassignment to change the ordering of preferred > leader > > > > > would not actually switch the leader automatically. e.g. (1,2,3) > => > > > > > (2,3,1). unless preferred leader election is run to switch current > > > > > leader from 1 to 2. So the operation is at least 2 x O(N). and > then > > > > > after the broker is back to normal, another 2 x O(N) to rollback. > > > > > > > > Hi George, > > > > > > > > Hmm. I guess I'm still on the fence about this feature. > > > > > > > > In your example, I think we're comparing apples and oranges. You > > > > started by outlining a scenario where "an empty broker... comes up... > > > > [without] any leadership[s]." But then you criticize using > > > > reassignment to switch the order of preferred replicas because it > > > > "would not actually switch the leader automatically." If the empty > > > > broker doesn't have any leaderships, there is nothing to be switched, > > > > right? > > > > > > > > > > > > > > > > > > > > In general, reassignment will get a lot easier and quicker once > > > KIP-455 is implemented. > Reassignments that just change the order of > > > preferred replicas for a specific partition should complete pretty much > > > instantly. > > > > > >> I think it's simpler and easier just to have one source of truth > > > for what the preferred replica is for a partition, rather than two. So > > > for> me, the fact that the replica assignment ordering isn't changed is > > > actually a big disadvantage of this KIP. If you are a new user (or > just> > > > an existing user that didn't read all of the documentation) and you > just > > > look at the replica assignment, you might be confused by why> a > particular > > > broker wasn't getting any leaderships, even though it appeared like it > > > should. More mechanisms mean more complexity> for users and developers > > > most of the time. > > > > > > > > > > > > > > > I would like stress the point that running reassignment to change > the > > > > > ordering of the replica (putting a broker to the end of partition > > > > > assignment) is unnecessary, because after some time the broker is > > > > > caught up, it can start serving traffic and then need to run > > > > > reassignments again to "rollback" to previous states. As I > mentioned > > > in > > > > > KIP-491, this is just tedious work. > > > > > > > > In general, using an external rebalancing tool like Cruise Control > is a > > > > good idea to keep things balanced without having deal with manual > > > > rebalancing. We expect more and more people who have a complex or > > > > large cluster will start using tools like this. > > > > > > > > However, if you choose to do manual rebalancing, it shouldn't be that > > > > bad. You would save the existing partition ordering before making > your > > > > changes, then make your changes (perhaps by running a simple command > > > > line tool that switches the order of the replicas). Then, once you > > > > felt like the broker was ready to serve traffic, you could just > > > > re-apply the old ordering which you had saved. > > > > > > > > > > > > > > I agree this might introduce some complexities for > users/developers. > > > > > But if this feature is good, and well documented, it is good for > the > > > > > kafka product/community. Just like KIP-460 enabling unclean leader > > > > > election to override TopicLevel/Broker Level config of > > > > > `unclean.leader.election.enable` > > > > > > > > > > > I agree that it would be nice if we could treat some brokers > > > differently for the purposes of placing replicas, selecting leaders, > etc. > > > > Right now, we don't have any way of implementing that without forking > the > > > broker. I would support a new PlacementPolicy class that> would close > this > > > gap. But I don't think this KIP is flexible enough to fill this > role. For > > > example, it can't prevent users from creating> new single-replica > topics > > > that get put on the "bad" replica. Perhaps we should reopen the > > > discussion> about > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces > > > > > > > > > > Creating topic with single-replica is beyond what KIP-491 is > trying to > > > > > achieve. The user needs to take responsibility of doing that. I do > > > see > > > > > some Samza clients notoriously creating single-replica topics and > that > > > > > got flagged by alerts, because a single broker down/maintenance > will > > > > > cause offline partitions. For KIP-491 preferred leader "blacklist", > > > > > the single-replica will still serve as leaders, because there is no > > > > > other alternative replica to be chosen as leader. > > > > > > > > > > Even with a new PlacementPolicy for topic creation/partition > > > expansion, > > > > > it still needs the blacklist info (e.g. a zk path node, or broker > > > > > level/topic level config) to "blacklist" the broker to be preferred > > > > > leader? Would it be the same as KIP-491 is introducing? > > > > > > > > I was thinking about a PlacementPolicy filling the role of preventing > > > > people from creating single-replica partitions on a node that we > didn't > > > > want to ever be the leader. I thought that it could also prevent > > > > people from designating those nodes as preferred leaders during topic > > > > creation, or Kafka from doing itduring random topic creation. I was > > > > assuming that the PlacementPolicy would determine which nodes were > > > > which through static configuration keys. I agree static > configuration > > > > keys are somewhat less flexible than dynamic configuration. > > > > > > > > best, > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > George > > > > > > > > > > On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe > > > > > <cmcc...@apache.org> wrote: > > > > > > > > > > On Fri, Aug 2, 2019, at 20:02, George Li wrote: > > > > > > Hi Colin, > > > > > > Thanks for looking into this KIP. Sorry for the late response. > been > > > busy. > > > > > > > > > > > > If a cluster has MAMY topic partitions, moving this "blacklist" > > > broker > > > > > > to the end of replica list is still a rather "big" operation, > > > involving > > > > > > submitting reassignments. The KIP-491 way of blacklist is much > > > > > > simpler/easier and can undo easily without changing the replica > > > > > > assignment ordering. > > > > > > > > > > Hi George, > > > > > > > > > > Even if you have a way of blacklisting an entire broker all at > once, > > > > > you still would need to run a leader election for each partition > where > > > > > you want to move the leader off of the blacklisted broker. So the > > > > > operation is still O(N) in that sense-- you have to do something > per > > > > > partition. > > > > > > > > > > In general, reassignment will get a lot easier and quicker once > > > KIP-455 > > > > > is implemented. Reassignments that just change the order of > preferred > > > > > replicas for a specific partition should complete pretty much > > > instantly. > > > > > > > > > > I think it's simpler and easier just to have one source of truth > for > > > > > what the preferred replica is for a partition, rather than two. So > > > for > > > > > me, the fact that the replica assignment ordering isn't changed is > > > > > actually a big disadvantage of this KIP. If you are a new user (or > > > > > just an existing user that didn't read all of the documentation) > and > > > > > you just look at the replica assignment, you might be confused by > why > > > a > > > > > particular broker wasn't getting any leaderships, even though it > > > > > appeared like it should. More mechanisms mean more complexity for > > > > > users and developers most of the time. > > > > > > > > > > > Major use case for me, a failed broker got swapped with new > > > hardware, > > > > > > and starts up as empty (with latest offset of all partitions), > the > > > SLA > > > > > > of retention is 1 day, so before this broker is up to be in-sync > for > > > 1 > > > > > > day, we would like to blacklist this broker from serving traffic. > > > after > > > > > > 1 day, the blacklist is removed and run preferred leader > election. > > > > > > This way, no need to run reassignments before/after. This is the > > > > > > "temporary" use-case. > > > > > > > > > > What if we just add an option to the reassignment tool to generate > a > > > > > plan to move all the leaders off of a specific broker? The tool > could > > > > > also run a leader election as well. That would be a simple way of > > > > > doing this without adding new mechanisms or broker-side > > > configurations, > > > > > etc. > > > > > > > > > > > > > > > > > There are use-cases that this Preferred Leader "blacklist" can be > > > > > > somewhat permanent, as I explained in the AWS data center > instances > > > Vs. > > > > > > on-premises data center bare metal machines (heterogenous > hardware), > > > > > > that the AWS broker_ids will be blacklisted. So new topics > > > created, > > > > > > or existing topic expansion would not make them serve traffic > even > > > they > > > > > > could be the preferred leader. > > > > > > > > > > I agree that it would be nice if we could treat some brokers > > > > > differently for the purposes of placing replicas, selecting > leaders, > > > > > etc. Right now, we don't have any way of implementing that without > > > > > forking the broker. I would support a new PlacementPolicy class > that > > > > > would close this gap. But I don't think this KIP is flexible > enough > > > to > > > > > fill this role. For example, it can't prevent users from creating > new > > > > > single-replica topics that get put on the "bad" replica. Perhaps > we > > > > > should reopen the discussion about > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces > > > > > > > > > > regards, > > > > > Colin > > > > > > > > > > > > > > > > > Please let me know there are more question. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > George > > > > > > > > > > > > On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe > > > > > > <cmcc...@apache.org> wrote: > > > > > > > > > > > > We still want to give the "blacklisted" broker the leadership if > > > > > > nobody else is available. Therefore, isn't putting a broker on > the > > > > > > blacklist pretty much the same as moving it to the last entry in > the > > > > > > replicas list and then triggering a preferred leader election? > > > > > > > > > > > > If we want this to be undone after a certain amount of time, or > > > under > > > > > > certain conditions, that seems like something that would be more > > > > > > effectively done by an external system, rather than putting all > > > these > > > > > > policies into Kafka. > > > > > > > > > > > > best, > > > > > > Colin > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote: > > > > > > > Hi Satish, > > > > > > > Thanks for the reviews and feedbacks. > > > > > > > > > > > > > > > > The following is the requirements this KIP is trying to > > > accomplish: > > > > > > > > This can be moved to the"Proposed changes" section. > > > > > > > > > > > > > > Updated the KIP-491. > > > > > > > > > > > > > > > >>The logic to determine the priority/order of which broker > > > should be > > > > > > > > preferred leader should be modified. The broker in the > > > preferred leader > > > > > > > > blacklist should be moved to the end (lowest priority) when > > > > > > > > determining leadership. > > > > > > > > > > > > > > > > I believe there is no change required in the ordering of the > > > preferred > > > > > > > > replica list. Brokers in the preferred leader blacklist are > > > skipped > > > > > > > > until other brokers int he list are unavailable. > > > > > > > > > > > > > > Yes. partition assignment remained the same, replica & > ordering. > > > The > > > > > > > blacklist logic can be optimized during implementation. > > > > > > > > > > > > > > > >>The blacklist can be at the broker level. However, there > might > > > be use cases > > > > > > > > where a specific topic should blacklist particular brokers, > which > > > > > > > > would be at the > > > > > > > > Topic level Config. For this use cases of this KIP, it seems > > > that broker level > > > > > > > > blacklist would suffice. Topic level preferred leader > blacklist > > > might > > > > > > > > be future enhancement work. > > > > > > > > > > > > > > > > I agree that the broker level preferred leader blacklist > would be > > > > > > > > sufficient. Do you have any use cases which require topic > level > > > > > > > > preferred blacklist? > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't have any concrete use cases for Topic level preferred > > > leader > > > > > > > blacklist. One scenarios I can think of is when a broker has > high > > > CPU > > > > > > > usage, trying to identify the big topics (High MsgIn, High > > > BytesIn, > > > > > > > etc), then try to move the leaders away from this broker, > before > > > doing > > > > > > > an actual reassignment to change its preferred leader, try to > put > > > this > > > > > > > preferred_leader_blacklist in the Topic Level config, and run > > > preferred > > > > > > > leader election, and see whether CPU decreases for this broker, > > > if > > > > > > > yes, then do the reassignments to change the preferred leaders > to > > > be > > > > > > > "permanent" (the topic may have many partitions like 256 that > has > > > quite > > > > > > > a few of them having this broker as preferred leader). So this > > > Topic > > > > > > > Level config is an easy way of doing trial and check the > result. > > > > > > > > > > > > > > > > > > > > > > You can add the below workaround as an item in the rejected > > > alternatives section > > > > > > > > "Reassigning all the topic/partitions which the intended > broker > > > is a > > > > > > > > replica for." > > > > > > > > > > > > > > Updated the KIP-491. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > George > > > > > > > > > > > > > > On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana > > > > > > > <satish.dugg...@gmail.com> wrote: > > > > > > > > > > > > > > Thanks for the KIP. I have put my comments below. > > > > > > > > > > > > > > This is a nice improvement to avoid cumbersome maintenance. > > > > > > > > > > > > > > >> The following is the requirements this KIP is trying to > > > accomplish: > > > > > > > The ability to add and remove the preferred leader > deprioritized > > > > > > > list/blacklist. e.g. new ZK path/node or new dynamic config. > > > > > > > > > > > > > > This can be moved to the"Proposed changes" section. > > > > > > > > > > > > > > >>The logic to determine the priority/order of which broker > should > > > be > > > > > > > preferred leader should be modified. The broker in the > preferred > > > leader > > > > > > > blacklist should be moved to the end (lowest priority) when > > > > > > > determining leadership. > > > > > > > > > > > > > > I believe there is no change required in the ordering of the > > > preferred > > > > > > > replica list. Brokers in the preferred leader blacklist are > skipped > > > > > > > until other brokers int he list are unavailable. > > > > > > > > > > > > > > >>The blacklist can be at the broker level. However, there > might > > > be use cases > > > > > > > where a specific topic should blacklist particular brokers, > which > > > > > > > would be at the > > > > > > > Topic level Config. For this use cases of this KIP, it seems > that > > > broker level > > > > > > > blacklist would suffice. Topic level preferred leader > blacklist > > > might > > > > > > > be future enhancement work. > > > > > > > > > > > > > > I agree that the broker level preferred leader blacklist would > be > > > > > > > sufficient. Do you have any use cases which require topic level > > > > > > > preferred blacklist? > > > > > > > > > > > > > > You can add the below workaround as an item in the rejected > > > alternatives section > > > > > > > "Reassigning all the topic/partitions which the intended > broker is > > > a > > > > > > > replica for." > > > > > > > > > > > > > > Thanks, > > > > > > > Satish. > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski > > > > > > > <stanis...@confluent.io> wrote: > > > > > > > > > > > > > > > > Hey George, > > > > > > > > > > > > > > > > Thanks for the KIP, it's an interesting idea. > > > > > > > > > > > > > > > > I was wondering whether we could achieve the same thing via > the > > > > > > > > kafka-reassign-partitions tool. As you had also said in the > > > JIRA, it is > > > > > > > > true that this is currently very tedious with the tool. My > > > thoughts are > > > > > > > > that we could improve the tool and give it the notion of a > > > "blacklisted > > > > > > > > preferred leader". > > > > > > > > This would have some benefits like: > > > > > > > > - more fine-grained control over the blacklist. we may not > want > > > to > > > > > > > > blacklist all the preferred leaders, as that would make the > > > blacklisted > > > > > > > > broker a follower of last resort which is not very useful. In > > > the cases of > > > > > > > > an underpowered AWS machine or a controller, you might > overshoot > > > and make > > > > > > > > the broker very underutilized if you completely make it > > > leaderless. > > > > > > > > - is not permanent. If we are to have a blacklist leaders > config, > > > > > > > > rebalancing tools would also need to know about it and > > > manipulate/respect > > > > > > > > it to achieve a fair balance. > > > > > > > > It seems like both problems are tied to balancing partitions, > > > it's just > > > > > > > > that KIP-491's use case wants to balance them against other > > > factors in a > > > > > > > > more nuanced way. It makes sense to have both be done from > the > > > same place > > > > > > > > > > > > > > > > To make note of the motivation section: > > > > > > > > > Avoid bouncing broker in order to lose its leadership > > > > > > > > The recommended way to make a broker lose its leadership is > to > > > run a > > > > > > > > reassignment on its partitions > > > > > > > > > The cross-data center cluster has AWS cloud instances which > > > have less > > > > > > > > computing power > > > > > > > > We recommend running Kafka on homogeneous machines. It would > be > > > cool if the > > > > > > > > system supported more flexibility in that regard but that is > > > more nuanced > > > > > > > > and a preferred leader blacklist may not be the best first > > > approach to the > > > > > > > > issue > > > > > > > > > > > > > > > > Adding a new config which can fundamentally change the way > > > replication is > > > > > > > > done is complex, both for the system (the replication code is > > > complex > > > > > > > > enough) and the user. Users would have another potential > config > > > that could > > > > > > > > backfire on them - e.g if left forgotten. > > > > > > > > > > > > > > > > Could you think of any downsides to implementing this > > > functionality (or a > > > > > > > > variation of it) in the kafka-reassign-partitions.sh tool? > > > > > > > > One downside I can see is that we would not have it handle > new > > > partitions > > > > > > > > created after the "blacklist operation". As a first > iteration I > > > think that > > > > > > > > may be acceptable > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Stanislav > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li < > > > sql_consult...@yahoo.com.invalid> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > Pinging the list for the feedbacks of this KIP-491 ( > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982 > > > > > > > > > ) > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > George > > > > > > > > > > > > > > > > > > On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li < > > > > > > > > > sql_consult...@yahoo.com.INVALID> wrote: > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I have created KIP-491 ( > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982 > > > ) > > > > > > > > > for putting a broker to the preferred leader blacklist or > > > deprioritized > > > > > > > > > list so when determining leadership, it's moved to the > lowest > > > priority for > > > > > > > > > some of the listed use-cases. > > > > > > > > > > > > > > > > > > Please provide your comments/feedbacks. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > George > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Forwarded Message ----- From: Jose Armando Garcia > > > Sancio (JIRA) < > > > > > > > > > j...@apache.org>To: "sql_consult...@yahoo.com" < > > > sql_consult...@yahoo.com>Sent: > > > > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] > > > [Commented] > > > > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized > list) > > > > > > > > > > > > > > > > > > [ > > > > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511 > > > > > > > > > ] > > > > > > > > > > > > > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638: > > > > > > > > > --------------------------------------------------- > > > > > > > > > > > > > > > > > > Thanks for feedback and clear use cases [~sql_consulting]. > > > > > > > > > > > > > > > > > > > Preferred Leader Blacklist (deprioritized list) > > > > > > > > > > ----------------------------------------------- > > > > > > > > > > > > > > > > > > > > Key: KAFKA-8638 > > > > > > > > > > URL: > > > https://issues.apache.org/jira/browse/KAFKA-8638 > > > > > > > > > > Project: Kafka > > > > > > > > > > Issue Type: Improvement > > > > > > > > > > Components: config, controller, core > > > > > > > > > > Affects Versions: 1.1.1, 2.3.0, 2.2.1 > > > > > > > > > > Reporter: GEORGE LI > > > > > > > > > > Assignee: GEORGE LI > > > > > > > > > > Priority: Major > > > > > > > > > > > > > > > > > > > > Currently, the kafka preferred leader election will pick > the > > > broker_id > > > > > > > > > in the topic/partition replica assignments in a priority > order > > > when the > > > > > > > > > broker is in ISR. The preferred leader is the broker id in > the > > > first > > > > > > > > > position of replica. There are use-cases that, even the > first > > > broker in the > > > > > > > > > replica assignment is in ISR, there is a need for it to be > > > moved to the end > > > > > > > > > of ordering (lowest priority) when deciding leadership > during > > > preferred > > > > > > > > > leader election. > > > > > > > > > > Let’s use topic/partition replica (1,2,3) as an example. > 1 > > > is the > > > > > > > > > preferred leader. When preferred leadership is run, it > will > > > pick 1 as the > > > > > > > > > leader if it's ISR, if 1 is not online and in ISR, then > pick > > > 2, if 2 is not > > > > > > > > > in ISR, then pick 3 as the leader. There are use cases > that, > > > even 1 is in > > > > > > > > > ISR, we would like it to be moved to the end of ordering > > > (lowest priority) > > > > > > > > > when deciding leadership during preferred leader election. > > > Below is a list > > > > > > > > > of use cases: > > > > > > > > > > * (If broker_id 1 is a swapped failed host and brought up > > > with last > > > > > > > > > segments or latest offset without historical data (There is > > > another effort > > > > > > > > > on this), it's better for it to not serve leadership till > it's > > > caught-up. > > > > > > > > > > * The cross-data center cluster has AWS instances which > have > > > less > > > > > > > > > computing power than the on-prem bare metal machines. We > > > could put the AWS > > > > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem > brokers > > > can be elected > > > > > > > > > leaders, without changing the reassignments ordering of the > > > replicas. > > > > > > > > > > * If the broker_id 1 is constantly losing leadership > after > > > some time: > > > > > > > > > "Flapping". we would want to exclude 1 to be a leader > unless > > > all other > > > > > > > > > brokers of this topic/partition are offline. The > “Flapping” > > > effect was > > > > > > > > > seen in the past when 2 or more brokers were bad, when they > > > lost leadership > > > > > > > > > constantly/quickly, the sets of partition replicas they > belong > > > to will see > > > > > > > > > leadership constantly changing. The ultimate solution is > to > > > swap these bad > > > > > > > > > hosts. But for quick mitigation, we can also put the bad > > > hosts in the > > > > > > > > > Preferred Leader Blacklist to move the priority of its > being > > > elected as > > > > > > > > > leaders to the lowest. > > > > > > > > > > * If the controller is busy serving an extra load of > > > metadata requests > > > > > > > > > and other tasks. we would like to put the controller's > leaders > > > to other > > > > > > > > > brokers to lower its CPU load. currently bouncing to lose > > > leadership would > > > > > > > > > not work for Controller, because after the bounce, the > > > controller fails > > > > > > > > > over to another broker. > > > > > > > > > > * Avoid bouncing broker in order to lose its leadership: > it > > > would be > > > > > > > > > good if we have a way to specify which broker should be > > > excluded from > > > > > > > > > serving traffic/leadership (without changing the replica > > > assignment > > > > > > > > > ordering by reassignments, even though that's quick), and > run > > > preferred > > > > > > > > > leader election. A bouncing broker will cause temporary > URP, > > > and sometimes > > > > > > > > > other issues. Also a bouncing of broker (e.g. broker_id 1) > > > can temporarily > > > > > > > > > lose all its leadership, but if another broker (e.g. > broker_id > > > 2) fails or > > > > > > > > > gets bounced, some of its leaderships will likely failover > to > > > broker_id 1 > > > > > > > > > on a replica with 3 brokers. If broker_id 1 is in the > > > blacklist, then in > > > > > > > > > such a scenario even broker_id 2 offline, the 3rd broker > can > > > take > > > > > > > > > leadership. > > > > > > > > > > The current work-around of the above is to change the > > > topic/partition's > > > > > > > > > replica reassignments to move the broker_id 1 from the > first > > > position to > > > > > > > > > the last position and run preferred leader election. e.g. > (1, > > > 2, 3) => (2, > > > > > > > > > 3, 1). This changes the replica reassignments, and we need > to > > > keep track of > > > > > > > > > the original one and restore if things change (e.g. > controller > > > fails over > > > > > > > > > to another broker, the swapped empty broker caught up). > That’s > > > a rather > > > > > > > > > tedious task. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > This message was sent by Atlassian JIRA > > > > > > > > > (v7.6.3#76005) > > > -- Best, Stanislav