Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Colin McCabe Fri, 06 Sep 2019 14:20:50 -0700

One possibility would be writing a new command-line tool that would 
deprioritize a given replica using the new KIP-455 API.  Then it could write 
out a JSON files containing the old priorities, which could be restored when 
(or if) we needed to do so.  This seems like it might be simpler and easier to 
maintain than a separate set of metadata about blacklists.


best,
Colin


On Fri, Sep 6, 2019, at 11:58, George Li wrote:
>  Hi, 
> 
> Just want to ping and bubble up the discussion of KIP-491. 
> 
> On a large scale of Kafka clusters with thousands of brokers in many 
> clusters.  Frequent hardware failures are common, although the 
> reassignments to change the preferred leaders is a workaround, it 
> incurs unnecessary additional work than the proposed preferred leader 
> blacklist in KIP-491, and hard to scale. 
> 
> I am wondering whether others using Kafka in a big scale running into 
> same problem. 
> 
> 
> Satish,  
> 
> Regarding your previous question about whether there is use-case for 
> TopicLevel preferred leader "blacklist",  I thought about one 
> use-case:  to improve rebalance/reassignment, the large partition will 
> usually cause performance/stability issues, planning to change the say 
> the New Replica will start with Leader's latest offset(this way the 
> replica is almost instantly in the ISR and reassignment completed), and 
> put this partition's NewReplica into Preferred Leader "Blacklist" at 
> the Topic Level config for that partition. After sometime(retention 
> time), this new replica has caught up and ready to serve traffic, 
> update/remove the TopicConfig for this partition's preferred leader 
> blacklist. 
> 
> I will update the KIP-491 later for this use case of Topic Level config 
> for Preferred Leader Blacklist.
> 
> 
> Thanks,
> George
>  
>     On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li 
> <[email protected]> wrote:  
>  
>   Hi Colin,
> 
> > In your example, I think we're comparing apples and oranges.  You started 
> > by outlining a scenario where "an empty broker... comes up... [without] any 
> > > leadership[s]."  But then you criticize using reassignment to switch the 
> > order of preferred replicas because it "would not actually switch the 
> > leader > automatically."  If the empty broker doesn't have any leaderships, 
> > there is nothing to be switched, right?
> 
> Let me explained in details of this particular use case example for 
> comparing apples to apples. 
> 
> Let's say a healthy broker hosting 3000 partitions, and of which 1000 
> are the preferred leaders (leader count is 1000). There is a hardware 
> failure (disk/memory, etc.), and kafka process crashed. We swap this 
> host with another host but keep the same broker.id, when this new 
> broker coming up, it has no historical data, and we manage to have the 
> current last offsets of all partitions set in 
> the replication-offset-checkpoint (if we don't set them, it could cause 
> crazy ReplicaFetcher pulling of historical data from other brokers and 
> cause cluster high latency and other instabilities), so when Kafka is 
> brought up, it is quickly catching up as followers in the ISR.  Note, 
> we have auto.leader.rebalance.enable  disabled, so it's not serving any 
> traffic as leaders (leader count = 0), even there are 1000 partitions 
> that this broker is the Preferred Leader. 
> 
> We need to make this broker not serving traffic for a few hours or days 
> depending on the SLA of the topic retention requirement until after 
> it's having enough historical data. 
> 
> 
> * The traditional way using the reassignments to move this broker in 
> that 1000 partitions where it's the preferred leader to the end of  
> assignment, this is O(N) operation. and from my experience, we can't 
> submit all 1000 at the same time, otherwise cause higher latencies even 
> the reassignment in this case can complete almost instantly.  After  a 
> few hours/days whatever, this broker is ready to serve traffic,  we 
> have to run reassignments again to restore that 1000 partitions 
> preferred leaders for this broker: O(N) operation.  then run preferred 
> leader election O(N) again.  So total 3 x O(N) operations.  The point 
> is since the new empty broker is expected to be the same as the old one 
> in terms of hosting partition/leaders, it would seem unnecessary to do 
> reassignments (ordering of replica) during the broker catching up time. 
> 
> 
> 
> * The new feature Preferred Leader "Blacklist":  just need to put a 
> dynamic config to indicate that this broker should be considered leader 
> (preferred leader election or broker failover or unclean leader 
> election) to the lowest priority. NO need to run any reassignments. 
> After a few hours/days, when this broker is ready, remove the dynamic 
> config, and run preferred leader election and this broker will serve 
> traffic for that 1000 original partitions it was the preferred leader. 
> So total  1 x O(N) operation. 
> 
> 
> If auto.leader.rebalance.enable  is enabled,  the Preferred Leader 
> "Blacklist" can be put it before Kafka is started to prevent this 
> broker serving traffic.  In the traditional way of running 
> reassignments, once the broker is up, 
> with auto.leader.rebalance.enable  , if leadership starts going to this 
> new empty broker, it might have to do preferred leader election after 
> reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) 
> reassignment only change the ordering, 1 remains as the current leader, 
> and needs prefer leader election to change to 2 after reassignment. so 
> potentially one more O(N) operation. 
> 
> I hope the above example can show how easy to "blacklist" a broker 
> serving leadership.  For someone managing Production Kafka cluster, 
> it's important to react fast to certain alerts and mitigate/resolve 
> some issues. As I listed the other use cases in KIP-291, I think this 
> feature can make the Kafka product more easier to manage/operate. 
> 
> > In general, using an external rebalancing tool like Cruise Control is a 
> > good idea to keep things balanced without having deal with manual 
> > rebalancing.  > We expect more and more people who have a complex or large 
> > cluster will start using tools like this.
> > 
> > However, if you choose to do manual rebalancing, it shouldn't be that bad.  
> > You would save the existing partition ordering before making your changes, 
> > then> make your changes (perhaps by running a simple command line tool that 
> > switches the order of the replicas).  Then, once you felt like the broker 
> > was ready to> serve traffic, you could just re-apply the old ordering which 
> > you had saved.
> 
> 
> We do have our own rebalancing tool which has its own criteria like 
> Rack diversity,  disk usage,  spread partitions/leaders across all 
> brokers in the cluster per topic, leadership Bytes/BytesIn served per 
> broker, etc.  We can run reassignments. The point is whether it's 
> really necessary, and if there is more effective, easier, safer way to 
> do it.    
> 
> take another use case example of taking leadership out of busy 
> Controller to give it more power to serve metadata requests and other 
> work. The controller can failover, with the preferred leader 
> "blacklist",  it does not have to run reassignments again when 
> controller failover, just change the blacklisted broker_id. 
> 
> 
> > I was thinking about a PlacementPolicy filling the role of preventing 
> > people from creating single-replica partitions on a node that we didn't 
> > want to > ever be the leader.  I thought that it could also prevent people 
> > from designating those nodes as preferred leaders during topic creation, or 
> > Kafka from doing> itduring random topic creation.  I was assuming that the 
> > PlacementPolicy would determine which nodes were which through static 
> > configuration keys.  I agree> static configuration keys are somewhat less 
> > flexible than dynamic configuration.
> 
> 
> I think single-replica partition might not be a good example.  There 
> should not be any single-replica partition at all. If yes. it's 
> probably because of trying to save disk space with less replicas.  I 
> think at least minimum 2. The user purposely creating single-replica 
> partition will take full responsibilities of data loss and 
> unavailability when a broker fails or under maintenance. 
> 
> 
> I think it would be better to use dynamic instead of static config.  I 
> also think it would be better to have topic creation Policy enforced in 
> Kafka server OR an external service. We have an external/central 
> service managing topic creation/partition expansion which takes into 
> account of rack-diversity, replication factor (2, 3 or 4 depending on 
> cluster/topic type), Policy replicating the topic between kafka 
> clusters, etc.  
> 
> 
> 
> Thanks,
> George
> 
> 
>     On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe 
> <[email protected]> wrote:  
>  
>  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> >  Hi Colin,
> > 
> > Thanks for your feedbacks.  Comments below:
> > > Even if you have a way of blacklisting an entire broker all at once, you 
> > >still would need to run a leader election > for each partition where you 
> > >want to move the leader off of the blacklisted broker.  So the operation 
> > >is still O(N) in > that sense-- you have to do something per partition.
> > 
> > For a failed broker and swapped with an empty broker, when it comes up, 
> > it will not have any leadership, and we would like it to remain not 
> > having leaderships for a couple of hours or days. So there is no 
> > preferred leader election needed which incurs O(N) operation in this 
> > case.  Putting the preferred leader blacklist would safe guard this 
> > broker serving traffic during that time. otherwise, if another broker 
> > fails(if this broker is the 1st, 2nd in the assignment), or someone 
> > runs preferred leader election, this new "empty" broker can still get 
> > leaderships. 
> > 
> > Also running reassignment to change the ordering of preferred leader 
> > would not actually switch the leader automatically.  e.g.  (1,2,3) => 
> > (2,3,1). unless preferred leader election is run to switch current 
> > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then 
> > after the broker is back to normal, another 2 x O(N) to rollback. 
> 
> Hi George,
> 
> Hmm.  I guess I'm still on the fence about this feature.
> 
> In your example, I think we're comparing apples and oranges.  You 
> started by outlining a scenario where "an empty broker... comes up... 
> [without] any leadership[s]."  But then you criticize using 
> reassignment to switch the order of preferred replicas because it 
> "would not actually switch the leader automatically."  If the empty 
> broker doesn't have any leaderships, there is nothing to be switched, 
> right?
> 
> > 
> > 
> > > In general, reassignment will get a lot easier and quicker once KIP-455 
> > > is implemented.  > Reassignments that just change the order of preferred 
> > > replicas for a specific partition should complete pretty much instantly.
> > >> I think it's simpler and easier just to have one source of truth for 
> > >> what the preferred replica is for a partition, rather than two.  So for> 
> > >> me, the fact that the replica assignment ordering isn't changed is 
> > >> actually a big disadvantage of this KIP.  If you are a new user (or 
> > >> just>  an existing user that didn't read all of the documentation) and 
> > >> you just look at the replica assignment, you might be confused by why> a 
> > >> particular broker wasn't getting any leaderships, even  though it 
> > >> appeared like it should.  More mechanisms mean more complexity> for 
> > >> users and developers most of the time.
> > 
> > 
> > I would like stress the point that running reassignment to change the 
> > ordering of the replica (putting a broker to the end of partition 
> > assignment) is unnecessary, because after some time the broker is 
> > caught up, it can start serving traffic and then need to run 
> > reassignments again to "rollback" to previous states. As I mentioned in 
> > KIP-491, this is just tedious work. 
> 
> In general, using an external rebalancing tool like Cruise Control is a 
> good idea to keep things balanced without having deal with manual 
> rebalancing.  We expect more and more people who have a complex or 
> large cluster will start using tools like this.
> 
> However, if you choose to do manual rebalancing, it shouldn't be that 
> bad.  You would save the existing partition ordering before making your 
> changes, then make your changes (perhaps by running a simple command 
> line tool that switches the order of the replicas).  Then, once you 
> felt like the broker was ready to serve traffic, you could just 
> re-apply the old ordering which you had saved.
> 
> > 
> > I agree this might introduce some complexities for users/developers. 
> > But if this feature is good, and well documented, it is good for the 
> > kafka product/community.  Just like KIP-460 enabling unclean leader 
> > election to override TopicLevel/Broker Level config of 
> > `unclean.leader.election.enable`
> > 
> > > I agree that it would be nice if we could treat some brokers differently 
> > > for the purposes of placing replicas, selecting leaders, etc. > Right 
> > > now, we don't have any way of implementing that without forking the 
> > > broker.  I would support a new PlacementPolicy class that> would close 
> > > this gap.  But I don't think this KIP is flexible enough to fill this 
> > > role.  For example, it can't prevent users from creating> new 
> > > single-replica topics that get put on the "bad" replica.  Perhaps we 
> > > should reopen the discussion> about 
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > 
> > Creating topic with single-replica is beyond what KIP-491 is trying to 
> > achieve.  The user needs to take responsibility of doing that. I do see 
> > some Samza clients notoriously creating single-replica topics and that 
> > got flagged by alerts, because a single broker down/maintenance will 
> > cause offline partitions. For KIP-491 preferred leader "blacklist",  
> > the single-replica will still serve as leaders, because there is no 
> > other alternative replica to be chosen as leader. 
> > 
> > Even with a new PlacementPolicy for topic creation/partition expansion, 
> > it still needs the blacklist info (e.g. a zk path node, or broker 
> > level/topic level config) to "blacklist" the broker to be preferred 
> > leader? Would it be the same as KIP-491 is introducing? 
> 
> I was thinking about a PlacementPolicy filling the role of preventing 
> people from creating single-replica partitions on a node that we didn't 
> want to ever be the leader.  I thought that it could also prevent 
> people from designating those nodes as preferred leaders during topic 
> creation, or Kafka from doing itduring random topic creation.  I was 
> assuming that the PlacementPolicy would determine which nodes were 
> which through static configuration keys.  I agree static configuration 
> keys are somewhat less flexible than dynamic configuration.
> 
> best,
> Colin
> 
> 
> > 
> > 
> > Thanks,
> > George
> > 
> >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe 
> > <[email protected]> wrote:  
> >  
> >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > >  Hi Colin,
> > > Thanks for looking into this KIP.  Sorry for the late response. been 
> > > busy. 
> > > 
> > > If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> > > to the end of replica list is still a rather "big" operation, involving 
> > > submitting reassignments.  The KIP-491 way of blacklist is much 
> > > simpler/easier and can undo easily without changing the replica 
> > > assignment ordering. 
> > 
> > Hi George,
> > 
> > Even if you have a way of blacklisting an entire broker all at once, 
> > you still would need to run a leader election for each partition where 
> > you want to move the leader off of the blacklisted broker.  So the 
> > operation is still O(N) in that sense-- you have to do something per 
> > partition.
> > 
> > In general, reassignment will get a lot easier and quicker once KIP-455 
> > is implemented.  Reassignments that just change the order of preferred 
> > replicas for a specific partition should complete pretty much instantly.
> > 
> > I think it's simpler and easier just to have one source of truth for 
> > what the preferred replica is for a partition, rather than two.  So for 
> > me, the fact that the replica assignment ordering isn't changed is 
> > actually a big disadvantage of this KIP.  If you are a new user (or 
> > just an existing user that didn't read all of the documentation) and 
> > you just look at the replica assignment, you might be confused by why a 
> > particular broker wasn't getting any leaderships, even  though it 
> > appeared like it should.  More mechanisms mean more complexity for 
> > users and developers most of the time.
> > 
> > > Major use case for me, a failed broker got swapped with new hardware, 
> > > and starts up as empty (with latest offset of all partitions), the SLA 
> > > of retention is 1 day, so before this broker is up to be in-sync for 1 
> > > day, we would like to blacklist this broker from serving traffic. after 
> > > 1 day, the blacklist is removed and run preferred leader election.  
> > > This way, no need to run reassignments before/after.  This is the 
> > > "temporary" use-case.
> > 
> > What if we just add an option to the reassignment tool to generate a 
> > plan to move all the leaders off of a specific broker?  The tool could 
> > also run a leader election as well.  That would be a simple way of 
> > doing this without adding new mechanisms or broker-side configurations, 
> > etc.
> > 
> > > 
> > > There are use-cases that this Preferred Leader "blacklist" can be 
> > > somewhat permanent, as I explained in the AWS data center instances Vs. 
> > > on-premises data center bare metal machines (heterogenous hardware), 
> > > that the AWS broker_ids will be blacklisted.  So new topics created,  
> > > or existing topic expansion would not make them serve traffic even they 
> > > could be the preferred leader. 
> > 
> > I agree that it would be nice if we could treat some brokers 
> > differently for the purposes of placing replicas, selecting leaders, 
> > etc.  Right now, we don't have any way of implementing that without 
> > forking the broker.  I would support a new PlacementPolicy class that 
> > would close this gap.  But I don't think this KIP is flexible enough to 
> > fill this role.  For example, it can't prevent users from creating new 
> > single-replica topics that get put on the "bad" replica.  Perhaps we 
> > should reopen the discussion about 
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > 
> > regards,
> > Colin
> > 
> > > 
> > > Please let me know there are more question. 
> > > 
> > > 
> > > Thanks,
> > > George
> > > 
> > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> > > <[email protected]> wrote:  
> > >  
> > >  We still want to give the "blacklisted" broker the leadership if 
> > > nobody else is available.  Therefore, isn't putting a broker on the 
> > > blacklist pretty much the same as moving it to the last entry in the 
> > > replicas list and then triggering a preferred leader election?
> > > 
> > > If we want this to be undone after a certain amount of time, or under 
> > > certain conditions, that seems like something that would be more 
> > > effectively done by an external system, rather than putting all these 
> > > policies into Kafka.
> > > 
> > > best,
> > > Colin
> > > 
> > > 
> > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > >  Hi Satish,
> > > > Thanks for the reviews and feedbacks.
> > > > 
> > > > > > The following is the requirements this KIP is trying to accomplish:
> > > > > This can be moved to the"Proposed changes" section.
> > > > 
> > > > Updated the KIP-491. 
> > > > 
> > > > > >>The logic to determine the priority/order of which broker should be
> > > > > preferred leader should be modified.  The broker in the preferred 
> > > > > leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > > 
> > > > Yes. partition assignment remained the same, replica & ordering. The 
> > > > blacklist logic can be optimized during implementation. 
> > > > 
> > > > > >>The blacklist can be at the broker level. However, there might be 
> > > > > >>use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that 
> > > > > broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > > be future enhancement work.
> > > > > 
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > > 
> > > > 
> > > > 
> > > > I don't have any concrete use cases for Topic level preferred leader 
> > > > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > > > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > > > etc), then try to move the leaders away from this broker,  before doing 
> > > > an actual reassignment to change its preferred leader,  try to put this 
> > > > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > > > leader election, and see whether CPU decreases for this broker,  if 
> > > > yes, then do the reassignments to change the preferred leaders to be 
> > > > "permanent" (the topic may have many partitions like 256 that has quite 
> > > > a few of them having this broker as preferred leader).  So this Topic 
> > > > Level config is an easy way of doing trial and check the result. 
> > > > 
> > > > 
> > > > > You can add the below workaround as an item in the rejected 
> > > > > alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > > replica for."
> > > > 
> > > > Updated the KIP-491. 
> > > > 
> > > > 
> > > > 
> > > > Thanks, 
> > > > George
> > > > 
> > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > > > <[email protected]> wrote:  
> > > >  
> > > >  Thanks for the KIP. I have put my comments below.
> > > > 
> > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > 
> > > > >> The following is the requirements this KIP is trying to accomplish:
> > > >   The ability to add and remove the preferred leader deprioritized
> > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > 
> > > > This can be moved to the"Proposed changes" section.
> > > > 
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > > 
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > > 
> > > > >>The blacklist can be at the broker level. However, there might be use 
> > > > >>cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that 
> > > > broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > > 
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > > 
> > > > You can add the below workaround as an item in the rejected 
> > > > alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > > 
> > > > Thanks,
> > > > Satish.
> > > > 
> > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hey George,
> > > > >
> > > > > Thanks for the KIP, it's an interesting idea.
> > > > >
> > > > > I was wondering whether we could achieve the same thing via the
> > > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it 
> > > > > is
> > > > > true that this is currently very tedious with the tool. My thoughts 
> > > > > are
> > > > > that we could improve the tool and give it the notion of a 
> > > > > "blacklisted
> > > > > preferred leader".
> > > > > This would have some benefits like:
> > > > > - more fine-grained control over the blacklist. we may not want to
> > > > > blacklist all the preferred leaders, as that would make the 
> > > > > blacklisted
> > > > > broker a follower of last resort which is not very useful. In the 
> > > > > cases of
> > > > > an underpowered AWS machine or a controller, you might overshoot and 
> > > > > make
> > > > > the broker very underutilized if you completely make it leaderless.
> > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > rebalancing tools would also need to know about it and 
> > > > > manipulate/respect
> > > > > it to achieve a fair balance.
> > > > > It seems like both problems are tied to balancing partitions, it's 
> > > > > just
> > > > > that KIP-491's use case wants to balance them against other factors 
> > > > > in a
> > > > > more nuanced way. It makes sense to have both be done from the same 
> > > > > place
> > > > >
> > > > > To make note of the motivation section:
> > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > The recommended way to make a broker lose its leadership is to run a
> > > > > reassignment on its partitions
> > > > > > The cross-data center cluster has AWS cloud instances which have 
> > > > > > less
> > > > > computing power
> > > > > We recommend running Kafka on homogeneous machines. It would be cool 
> > > > > if the
> > > > > system supported more flexibility in that regard but that is more 
> > > > > nuanced
> > > > > and a preferred leader blacklist may not be the best first approach 
> > > > > to the
> > > > > issue
> > > > >
> > > > > Adding a new config which can fundamentally change the way 
> > > > > replication is
> > > > > done is complex, both for the system (the replication code is complex
> > > > > enough) and the user. Users would have another potential config that 
> > > > > could
> > > > > backfire on them - e.g if left forgotten.
> > > > >
> > > > > Could you think of any downsides to implementing this functionality 
> > > > > (or a
> > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > One downside I can see is that we would not have it handle new 
> > > > > partitions
> > > > > created after the "blacklist operation". As a first iteration I think 
> > > > > that
> > > > > may be acceptable
> > > > >
> > > > > Thanks,
> > > > > Stanislav
> > > > >
> > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li 
> > > > > <[email protected]>
> > > > > wrote:
> > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > )
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > I have created KIP-491 (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > > for putting a broker to the preferred leader blacklist or 
> > > > > > deprioritized
> > > > > > list so when determining leadership,  it's moved to the lowest 
> > > > > > priority for
> > > > > > some of the listed use-cases.
> > > > > >
> > > > > > Please provide your comments/feedbacks.
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >
> > > > > >
> > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio 
> > > > > >(JIRA) <
> > > > > > [email protected]>To: "[email protected]" 
> > > > > > <[email protected]>Sent:
> > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > >
> > > > > >    [
> > > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > ]
> > > > > >
> > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > ---------------------------------------------------
> > > > > >
> > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > >
> > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > -----------------------------------------------
> > > > > > >
> > > > > > >                Key: KAFKA-8638
> > > > > > >                URL: 
> > > > > > >https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > >            Project: Kafka
> > > > > > >          Issue Type: Improvement
> > > > > > >          Components: config, controller, core
> > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > >            Reporter: GEORGE LI
> > > > > > >            Assignee: GEORGE LI
> > > > > > >            Priority: Major
> > > > > > >
> > > > > > > Currently, the kafka preferred leader election will pick the 
> > > > > > > broker_id
> > > > > > in the topic/partition replica assignments in a priority order when 
> > > > > > the
> > > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > > position of replica. There are use-cases that, even the first 
> > > > > > broker in the
> > > > > > replica assignment is in ISR, there is a need for it to be moved to 
> > > > > > the end
> > > > > > of ordering (lowest priority) when deciding leadership during  
> > > > > > preferred
> > > > > > leader election.
> > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > > preferred leader.  When preferred leadership is run, it will pick 1 
> > > > > > as the
> > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 
> > > > > > 2 is not
> > > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 
> > > > > > is in
> > > > > > ISR, we would like it to be moved to the end of ordering (lowest 
> > > > > > priority)
> > > > > > when deciding leadership during preferred leader election.  Below 
> > > > > > is a list
> > > > > > of use cases:
> > > > > > > * (If broker_id 1 is a swapped failed host and brought up with 
> > > > > > > last
> > > > > > segments or latest offset without historical data (There is another 
> > > > > > effort
> > > > > > on this), it's better for it to not serve leadership till it's 
> > > > > > caught-up.
> > > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > > computing power than the on-prem bare metal machines.  We could put 
> > > > > > the AWS
> > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be 
> > > > > > elected
> > > > > > leaders, without changing the reassignments ordering of the 
> > > > > > replicas.
> > > > > > > * If the broker_id 1 is constantly losing leadership after some 
> > > > > > > time:
> > > > > > "Flapping". we would want to exclude 1 to be a leader unless all 
> > > > > > other
> > > > > > brokers of this topic/partition are offline.  The “Flapping” effect 
> > > > > > was
> > > > > > seen in the past when 2 or more brokers were bad, when they lost 
> > > > > > leadership
> > > > > > constantly/quickly, the sets of partition replicas they belong to 
> > > > > > will see
> > > > > > leadership constantly changing.  The ultimate solution is to swap 
> > > > > > these bad
> > > > > > hosts.  But for quick mitigation, we can also put the bad hosts in 
> > > > > > the
> > > > > > Preferred Leader Blacklist to move the priority of its being 
> > > > > > elected as
> > > > > > leaders to the lowest.
> > > > > > > *  If the controller is busy serving an extra load of metadata 
> > > > > > > requests
> > > > > > and other tasks. we would like to put the controller's leaders to 
> > > > > > other
> > > > > > brokers to lower its CPU load. currently bouncing to lose 
> > > > > > leadership would
> > > > > > not work for Controller, because after the bounce, the controller 
> > > > > > fails
> > > > > > over to another broker.
> > > > > > > * Avoid bouncing broker in order to lose its leadership: it would 
> > > > > > > be
> > > > > > good if we have a way to specify which broker should be excluded 
> > > > > > from
> > > > > > serving traffic/leadership (without changing the replica assignment
> > > > > > ordering by reassignments, even though that's quick), and run 
> > > > > > preferred
> > > > > > leader election.  A bouncing broker will cause temporary URP, and 
> > > > > > sometimes
> > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can 
> > > > > > temporarily
> > > > > > lose all its leadership, but if another broker (e.g. broker_id 2) 
> > > > > > fails or
> > > > > > gets bounced, some of its leaderships will likely failover to 
> > > > > > broker_id 1
> > > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, 
> > > > > > then in
> > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > > leadership.
> > > > > > > The current work-around of the above is to change the 
> > > > > > > topic/partition's
> > > > > > replica reassignments to move the broker_id 1 from the first 
> > > > > > position to
> > > > > > the last position and run preferred leader election. e.g. (1, 2, 3) 
> > > > > > => (2,
> > > > > > 3, 1). This changes the replica reassignments, and we need to keep 
> > > > > > track of
> > > > > > the original one and restore if things change (e.g. controller 
> > > > > > fails over
> > > > > > to another broker, the swapped empty broker caught up). That’s a 
> > > > > > rather
> > > > > > tedious task.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > This message was sent by Atlassian JIRA
> > > > > > (v7.6.3#76005)

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Reply via email to