I agree with Colin that the same result should be achievable through proper
abstraction in a tool. Even if that might be "4xO(N)" operations, that is
still not a lot - it is still classified as O(N)

Let's say a healthy broker hosting 3000 partitions, and of which 1000 are
> the preferred leaders (leader count is 1000). There is a hardware failure
> (disk/memory, etc.), and kafka process crashed. We swap this host with
> another host but keep the same broker.id, when this new broker coming up,
> it has no historical data, and we manage to have the current last offsets
> of all partitions set in the replication-offset-checkpoint (if we don't set
> them, it could cause crazy ReplicaFetcher pulling of historical data from
> other brokers and cause cluster high latency and other instabilities), so
> when Kafka is brought up, it is quickly catching up as followers in the
> ISR.  Note, we have auto.leader.rebalance.enable  disabled, so it's not
> serving any traffic as leaders (leader count = 0), even there are 1000
> partitions that this broker is the Preferred Leader.
> We need to make this broker not serving traffic for a few hours or days
> depending on the SLA of the topic retention requirement until after it's
> having enough historical data.


This sounds like a bit of a hack. If that is the concern, why not propose a
KIP that addresses the specific issue? Having a blacklist you control still
seems like a workaround given that Kafka itself knows when the topic
retention would allow you to switch that replica to a leader

I really hope we can come up with a solution that avoids complicating the
controller and state machine logic further.
Could you please list out the main drawbacks of abstract this away in the
reassignments tool (or a new tool)?

On Mon, Sep 9, 2019 at 7:53 AM Colin McCabe <cmcc...@apache.org> wrote:

> On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote:
> > Hi Colin,
> >           Can you give us more details on why you don't want this to be
> > part of the Kafka core. You are proposing KIP-500 which will take away
> > zookeeper and writing this interim tools to change the zookeeper
> > metadata doesn't make sense to me.
>
> Hi Harsha,
>
> The reassignment API described in KIP-455, which will be part of Kafka
> 2.4, doesn't rely on ZooKeeper.  This API will stay the same after KIP-500
> is implemented.
>
> > As George pointed out there are
> > several benefits having it in the system itself instead of asking users
> > to hack bunch of json files to deal with outage scenario.
>
> In both cases, the user just has to run a shell command, right?  In both
> cases, the user has to remember to undo the command later when they want
> the broker to be treated normally again.  And in both cases, the user
> should probably be running an external rebalancing tool to avoid having to
> run these commands manually. :)
>
> best,
> Colin
>
> >
> > Thanks,
> > Harsha
> >
> > On Fri, Sep 6, 2019 at 4:36 PM George Li <sql_consult...@yahoo.com
> .invalid>
> > wrote:
> >
> > >  Hi Colin,
> > >
> > > Thanks for the feedback.  The "separate set of metadata about
> blacklists"
> > > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple
> in
> > > the cluster.  Should be easier than keeping json files?  e.g. what if
> we
> > > first blacklist broker_id_1, then another broker_id_2 has issues, and
> we
> > > need to write out another json file to restore later (and in which
> order)?
> > >  Using blacklist, we can just add the broker_id_2 to the existing one.
> and
> > > remove whatever broker_id returning to good state without worrying
> how(the
> > > ordering of putting the broker to blacklist) to restore.
> > >
> > > For topic level config,  the blacklist will be tied to
> > > topic/partition(e.g.  Configs:
> > > topic.preferred.leader.blacklist=0:101,102;1:103    where 0 & 1 is the
> > > partition#, 101,102,103 are the blacklist broker_ids), and easier to
> > > update/remove, no need for external json files?
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >     On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe <
> > > cmcc...@apache.org> wrote:
> > >
> > >  One possibility would be writing a new command-line tool that would
> > > deprioritize a given replica using the new KIP-455 API.  Then it could
> > > write out a JSON files containing the old priorities, which could be
> > > restored when (or if) we needed to do so.  This seems like it might be
> > > simpler and easier to maintain than a separate set of metadata about
> > > blacklists.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Fri, Sep 6, 2019, at 11:58, George Li wrote:
> > > >  Hi,
> > > >
> > > > Just want to ping and bubble up the discussion of KIP-491.
> > > >
> > > > On a large scale of Kafka clusters with thousands of brokers in many
> > > > clusters.  Frequent hardware failures are common, although the
> > > > reassignments to change the preferred leaders is a workaround, it
> > > > incurs unnecessary additional work than the proposed preferred leader
> > > > blacklist in KIP-491, and hard to scale.
> > > >
> > > > I am wondering whether others using Kafka in a big scale running into
> > > > same problem.
> > > >
> > > >
> > > > Satish,
> > > >
> > > > Regarding your previous question about whether there is use-case for
> > > > TopicLevel preferred leader "blacklist",  I thought about one
> > > > use-case:  to improve rebalance/reassignment, the large partition
> will
> > > > usually cause performance/stability issues, planning to change the
> say
> > > > the New Replica will start with Leader's latest offset(this way the
> > > > replica is almost instantly in the ISR and reassignment completed),
> and
> > > > put this partition's NewReplica into Preferred Leader "Blacklist" at
> > > > the Topic Level config for that partition. After sometime(retention
> > > > time), this new replica has caught up and ready to serve traffic,
> > > > update/remove the TopicConfig for this partition's preferred leader
> > > > blacklist.
> > > >
> > > > I will update the KIP-491 later for this use case of Topic Level
> config
> > > > for Preferred Leader Blacklist.
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li
> > > > <sql_consult...@yahoo.com> wrote:
> > > >
> > > >  Hi Colin,
> > > >
> > > > > In your example, I think we're comparing apples and oranges.  You
> > > started by outlining a scenario where "an empty broker... comes up...
> > > [without] any > leadership[s]."  But then you criticize using
> reassignment
> > > to switch the order of preferred replicas because it "would not
> actually
> > > switch the leader > automatically."  If the empty broker doesn't have
> any
> > > leaderships, there is nothing to be switched, right?
> > > >
> > > > Let me explained in details of this particular use case example for
> > > > comparing apples to apples.
> > > >
> > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> > > > are the preferred leaders (leader count is 1000). There is a hardware
> > > > failure (disk/memory, etc.), and kafka process crashed. We swap this
> > > > host with another host but keep the same broker.id, when this new
> > > > broker coming up, it has no historical data, and we manage to have
> the
> > > > current last offsets of all partitions set in
> > > > the replication-offset-checkpoint (if we don't set them, it could
> cause
> > > > crazy ReplicaFetcher pulling of historical data from other brokers
> and
> > > > cause cluster high latency and other instabilities), so when Kafka is
> > > > brought up, it is quickly catching up as followers in the ISR.  Note,
> > > > we have auto.leader.rebalance.enable  disabled, so it's not serving
> any
> > > > traffic as leaders (leader count = 0), even there are 1000 partitions
> > > > that this broker is the Preferred Leader.
> > > >
> > > > We need to make this broker not serving traffic for a few hours or
> days
> > > > depending on the SLA of the topic retention requirement until after
> > > > it's having enough historical data.
> > > >
> > > >
> > > > * The traditional way using the reassignments to move this broker in
> > > > that 1000 partitions where it's the preferred leader to the end of
> > > > assignment, this is O(N) operation. and from my experience, we can't
> > > > submit all 1000 at the same time, otherwise cause higher latencies
> even
> > > > the reassignment in this case can complete almost instantly.  After
> a
> > > > few hours/days whatever, this broker is ready to serve traffic,  we
> > > > have to run reassignments again to restore that 1000 partitions
> > > > preferred leaders for this broker: O(N) operation.  then run
> preferred
> > > > leader election O(N) again.  So total 3 x O(N) operations.  The point
> > > > is since the new empty broker is expected to be the same as the old
> one
> > > > in terms of hosting partition/leaders, it would seem unnecessary to
> do
> > > > reassignments (ordering of replica) during the broker catching up
> time.
> > > >
> > > >
> > > >
> > > > * The new feature Preferred Leader "Blacklist":  just need to put a
> > > > dynamic config to indicate that this broker should be considered
> leader
> > > > (preferred leader election or broker failover or unclean leader
> > > > election) to the lowest priority. NO need to run any reassignments.
> > > > After a few hours/days, when this broker is ready, remove the dynamic
> > > > config, and run preferred leader election and this broker will serve
> > > > traffic for that 1000 original partitions it was the preferred
> leader.
> > > > So total  1 x O(N) operation.
> > > >
> > > >
> > > > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> > > > "Blacklist" can be put it before Kafka is started to prevent this
> > > > broker serving traffic.  In the traditional way of running
> > > > reassignments, once the broker is up,
> > > > with auto.leader.rebalance.enable  , if leadership starts going to
> this
> > > > new empty broker, it might have to do preferred leader election after
> > > > reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1)
> > > > reassignment only change the ordering, 1 remains as the current
> leader,
> > > > and needs prefer leader election to change to 2 after reassignment.
> so
> > > > potentially one more O(N) operation.
> > > >
> > > > I hope the above example can show how easy to "blacklist" a broker
> > > > serving leadership.  For someone managing Production Kafka cluster,
> > > > it's important to react fast to certain alerts and mitigate/resolve
> > > > some issues. As I listed the other use cases in KIP-291, I think this
> > > > feature can make the Kafka product more easier to manage/operate.
> > > >
> > > > > In general, using an external rebalancing tool like Cruise Control
> is
> > > a good idea to keep things balanced without having deal with manual
> > > rebalancing.  > We expect more and more people who have a complex or
> large
> > > cluster will start using tools like this.
> > > > >
> > > > > However, if you choose to do manual rebalancing, it shouldn't be
> that
> > > bad.  You would save the existing partition ordering before making your
> > > changes, then> make your changes (perhaps by running a simple command
> line
> > > tool that switches the order of the replicas).  Then, once you felt
> like
> > > the broker was ready to> serve traffic, you could just re-apply the old
> > > ordering which you had saved.
> > > >
> > > >
> > > > We do have our own rebalancing tool which has its own criteria like
> > > > Rack diversity,  disk usage,  spread partitions/leaders across all
> > > > brokers in the cluster per topic, leadership Bytes/BytesIn served per
> > > > broker, etc.  We can run reassignments. The point is whether it's
> > > > really necessary, and if there is more effective, easier, safer way
> to
> > > > do it.
> > > >
> > > > take another use case example of taking leadership out of busy
> > > > Controller to give it more power to serve metadata requests and other
> > > > work. The controller can failover, with the preferred leader
> > > > "blacklist",  it does not have to run reassignments again when
> > > > controller failover, just change the blacklisted broker_id.
> > > >
> > > >
> > > > > I was thinking about a PlacementPolicy filling the role of
> preventing
> > > people from creating single-replica partitions on a node that we didn't
> > > want to > ever be the leader.  I thought that it could also prevent
> people
> > > from designating those nodes as preferred leaders during topic
> creation, or
> > > Kafka from doing> itduring random topic creation.  I was assuming that
> the
> > > PlacementPolicy would determine which nodes were which through static
> > > configuration keys.  I agree> static configuration keys are somewhat
> less
> > > flexible than dynamic configuration.
> > > >
> > > >
> > > > I think single-replica partition might not be a good example.  There
> > > > should not be any single-replica partition at all. If yes. it's
> > > > probably because of trying to save disk space with less replicas.  I
> > > > think at least minimum 2. The user purposely creating single-replica
> > > > partition will take full responsibilities of data loss and
> > > > unavailability when a broker fails or under maintenance.
> > > >
> > > >
> > > > I think it would be better to use dynamic instead of static config.
> I
> > > > also think it would be better to have topic creation Policy enforced
> in
> > > > Kafka server OR an external service. We have an external/central
> > > > service managing topic creation/partition expansion which takes into
> > > > account of rack-diversity, replication factor (2, 3 or 4 depending on
> > > > cluster/topic type), Policy replicating the topic between kafka
> > > > clusters, etc.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >
> > > >    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe
> > > > <cmcc...@apache.org> wrote:
> > > >
> > > >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > > > >  Hi Colin,
> > > > >
> > > > > Thanks for your feedbacks.  Comments below:
> > > > > > Even if you have a way of blacklisting an entire broker all at
> once,
> > > you still would need to run a leader election > for each partition
> where
> > > you want to move the leader off of the blacklisted broker.  So the
> > > operation is still O(N) in > that sense-- you have to do something per
> > > partition.
> > > > >
> > > > > For a failed broker and swapped with an empty broker, when it comes
> > > up,
> > > > > it will not have any leadership, and we would like it to remain not
> > > > > having leaderships for a couple of hours or days. So there is no
> > > > > preferred leader election needed which incurs O(N) operation in
> this
> > > > > case.  Putting the preferred leader blacklist would safe guard this
> > > > > broker serving traffic during that time. otherwise, if another
> broker
> > > > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > > > runs preferred leader election, this new "empty" broker can still
> get
> > > > > leaderships.
> > > > >
> > > > > Also running reassignment to change the ordering of preferred
> leader
> > > > > would not actually switch the leader automatically.  e.g.  (1,2,3)
> =>
> > > > > (2,3,1). unless preferred leader election is run to switch current
> > > > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and
> then
> > > > > after the broker is back to normal, another 2 x O(N) to rollback.
> > > >
> > > > Hi George,
> > > >
> > > > Hmm.  I guess I'm still on the fence about this feature.
> > > >
> > > > In your example, I think we're comparing apples and oranges.  You
> > > > started by outlining a scenario where "an empty broker... comes up...
> > > > [without] any leadership[s]."  But then you criticize using
> > > > reassignment to switch the order of preferred replicas because it
> > > > "would not actually switch the leader automatically."  If the empty
> > > > broker doesn't have any leaderships, there is nothing to be switched,
> > > > right?
> > > >
> > > > >
> > > > >
> > > > > > In general, reassignment will get a lot easier and quicker once
> > > KIP-455 is implemented.  > Reassignments that just change the order of
> > > preferred replicas for a specific partition should complete pretty much
> > > instantly.
> > > > > >> I think it's simpler and easier just to have one source of truth
> > > for what the preferred replica is for a partition, rather than two.  So
> > > for> me, the fact that the replica assignment ordering isn't changed is
> > > actually a big disadvantage of this KIP.  If you are a new user (or
> just>
> > > an existing user that didn't read all of the documentation) and you
> just
> > > look at the replica assignment, you might be confused by why> a
> particular
> > > broker wasn't getting any leaderships, even  though it appeared like it
> > > should.  More mechanisms mean more complexity> for users and developers
> > > most of the time.
> > > > >
> > > > >
> > > > > I would like stress the point that running reassignment to change
> the
> > > > > ordering of the replica (putting a broker to the end of partition
> > > > > assignment) is unnecessary, because after some time the broker is
> > > > > caught up, it can start serving traffic and then need to run
> > > > > reassignments again to "rollback" to previous states. As I
> mentioned
> > > in
> > > > > KIP-491, this is just tedious work.
> > > >
> > > > In general, using an external rebalancing tool like Cruise Control
> is a
> > > > good idea to keep things balanced without having deal with manual
> > > > rebalancing.  We expect more and more people who have a complex or
> > > > large cluster will start using tools like this.
> > > >
> > > > However, if you choose to do manual rebalancing, it shouldn't be that
> > > > bad.  You would save the existing partition ordering before making
> your
> > > > changes, then make your changes (perhaps by running a simple command
> > > > line tool that switches the order of the replicas).  Then, once you
> > > > felt like the broker was ready to serve traffic, you could just
> > > > re-apply the old ordering which you had saved.
> > > >
> > > > >
> > > > > I agree this might introduce some complexities for
> users/developers.
> > > > > But if this feature is good, and well documented, it is good for
> the
> > > > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > > > election to override TopicLevel/Broker Level config of
> > > > > `unclean.leader.election.enable`
> > > > >
> > > > > > I agree that it would be nice if we could treat some brokers
> > > differently for the purposes of placing replicas, selecting leaders,
> etc. >
> > > Right now, we don't have any way of implementing that without forking
> the
> > > broker.  I would support a new PlacementPolicy class that> would close
> this
> > > gap.  But I don't think this KIP is flexible enough to fill this
> role.  For
> > > example, it can't prevent users from creating> new single-replica
> topics
> > > that get put on the "bad" replica.  Perhaps we should reopen the
> > > discussion> about
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > > >
> > > > > Creating topic with single-replica is beyond what KIP-491 is
> trying to
> > > > > achieve.  The user needs to take responsibility of doing that. I do
> > > see
> > > > > some Samza clients notoriously creating single-replica topics and
> that
> > > > > got flagged by alerts, because a single broker down/maintenance
> will
> > > > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > > > the single-replica will still serve as leaders, because there is no
> > > > > other alternative replica to be chosen as leader.
> > > > >
> > > > > Even with a new PlacementPolicy for topic creation/partition
> > > expansion,
> > > > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > > > level/topic level config) to "blacklist" the broker to be preferred
> > > > > leader? Would it be the same as KIP-491 is introducing?
> > > >
> > > > I was thinking about a PlacementPolicy filling the role of preventing
> > > > people from creating single-replica partitions on a node that we
> didn't
> > > > want to ever be the leader.  I thought that it could also prevent
> > > > people from designating those nodes as preferred leaders during topic
> > > > creation, or Kafka from doing itduring random topic creation.  I was
> > > > assuming that the PlacementPolicy would determine which nodes were
> > > > which through static configuration keys.  I agree static
> configuration
> > > > keys are somewhat less flexible than dynamic configuration.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > > > <cmcc...@apache.org> wrote:
> > > > >
> > > > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > > > >  Hi Colin,
> > > > > > Thanks for looking into this KIP.  Sorry for the late response.
> been
> > > busy.
> > > > > >
> > > > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> > > broker
> > > > > > to the end of replica list is still a rather "big" operation,
> > > involving
> > > > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > > > simpler/easier and can undo easily without changing the replica
> > > > > > assignment ordering.
> > > > >
> > > > > Hi George,
> > > > >
> > > > > Even if you have a way of blacklisting an entire broker all at
> once,
> > > > > you still would need to run a leader election for each partition
> where
> > > > > you want to move the leader off of the blacklisted broker.  So the
> > > > > operation is still O(N) in that sense-- you have to do something
> per
> > > > > partition.
> > > > >
> > > > > In general, reassignment will get a lot easier and quicker once
> > > KIP-455
> > > > > is implemented.  Reassignments that just change the order of
> preferred
> > > > > replicas for a specific partition should complete pretty much
> > > instantly.
> > > > >
> > > > > I think it's simpler and easier just to have one source of truth
> for
> > > > > what the preferred replica is for a partition, rather than two.  So
> > > for
> > > > > me, the fact that the replica assignment ordering isn't changed is
> > > > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > > > just an existing user that didn't read all of the documentation)
> and
> > > > > you just look at the replica assignment, you might be confused by
> why
> > > a
> > > > > particular broker wasn't getting any leaderships, even  though it
> > > > > appeared like it should.  More mechanisms mean more complexity for
> > > > > users and developers most of the time.
> > > > >
> > > > > > Major use case for me, a failed broker got swapped with new
> > > hardware,
> > > > > > and starts up as empty (with latest offset of all partitions),
> the
> > > SLA
> > > > > > of retention is 1 day, so before this broker is up to be in-sync
> for
> > > 1
> > > > > > day, we would like to blacklist this broker from serving traffic.
> > > after
> > > > > > 1 day, the blacklist is removed and run preferred leader
> election.
> > > > > > This way, no need to run reassignments before/after.  This is the
> > > > > > "temporary" use-case.
> > > > >
> > > > > What if we just add an option to the reassignment tool to generate
> a
> > > > > plan to move all the leaders off of a specific broker?  The tool
> could
> > > > > also run a leader election as well.  That would be a simple way of
> > > > > doing this without adding new mechanisms or broker-side
> > > configurations,
> > > > > etc.
> > > > >
> > > > > >
> > > > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > > > somewhat permanent, as I explained in the AWS data center
> instances
> > > Vs.
> > > > > > on-premises data center bare metal machines (heterogenous
> hardware),
> > > > > > that the AWS broker_ids will be blacklisted.  So new topics
> > > created,
> > > > > > or existing topic expansion would not make them serve traffic
> even
> > > they
> > > > > > could be the preferred leader.
> > > > >
> > > > > I agree that it would be nice if we could treat some brokers
> > > > > differently for the purposes of placing replicas, selecting
> leaders,
> > > > > etc.  Right now, we don't have any way of implementing that without
> > > > > forking the broker.  I would support a new PlacementPolicy class
> that
> > > > > would close this gap.  But I don't think this KIP is flexible
> enough
> > > to
> > > > > fill this role.  For example, it can't prevent users from creating
> new
> > > > > single-replica topics that get put on the "bad" replica.  Perhaps
> we
> > > > > should reopen the discussion about
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > > >
> > > > > regards,
> > > > > Colin
> > > > >
> > > > > >
> > > > > > Please let me know there are more question.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > > > <cmcc...@apache.org> wrote:
> > > > > >
> > > > > >  We still want to give the "blacklisted" broker the leadership if
> > > > > > nobody else is available.  Therefore, isn't putting a broker on
> the
> > > > > > blacklist pretty much the same as moving it to the last entry in
> the
> > > > > > replicas list and then triggering a preferred leader election?
> > > > > >
> > > > > > If we want this to be undone after a certain amount of time, or
> > > under
> > > > > > certain conditions, that seems like something that would be more
> > > > > > effectively done by an external system, rather than putting all
> > > these
> > > > > > policies into Kafka.
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > > > >  Hi Satish,
> > > > > > > Thanks for the reviews and feedbacks.
> > > > > > >
> > > > > > > > > The following is the requirements this KIP is trying to
> > > accomplish:
> > > > > > > > This can be moved to the"Proposed changes" section.
> > > > > > >
> > > > > > > Updated the KIP-491.
> > > > > > >
> > > > > > > > >>The logic to determine the priority/order of which broker
> > > should be
> > > > > > > > preferred leader should be modified.  The broker in the
> > > preferred leader
> > > > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > > > determining leadership.
> > > > > > > >
> > > > > > > > I believe there is no change required in the ordering of the
> > > preferred
> > > > > > > > replica list. Brokers in the preferred leader blacklist are
> > > skipped
> > > > > > > > until other brokers int he list are unavailable.
> > > > > > >
> > > > > > > Yes. partition assignment remained the same, replica &
> ordering.
> > > The
> > > > > > > blacklist logic can be optimized during implementation.
> > > > > > >
> > > > > > > > >>The blacklist can be at the broker level. However, there
> might
> > > be use cases
> > > > > > > > where a specific topic should blacklist particular brokers,
> which
> > > > > > > > would be at the
> > > > > > > > Topic level Config. For this use cases of this KIP, it seems
> > > that broker level
> > > > > > > > blacklist would suffice.  Topic level preferred leader
> blacklist
> > > might
> > > > > > > > be future enhancement work.
> > > > > > > >
> > > > > > > > I agree that the broker level preferred leader blacklist
> would be
> > > > > > > > sufficient. Do you have any use cases which require topic
> level
> > > > > > > > preferred blacklist?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I don't have any concrete use cases for Topic level preferred
> > > leader
> > > > > > > blacklist.  One scenarios I can think of is when a broker has
> high
> > > CPU
> > > > > > > usage, trying to identify the big topics (High MsgIn, High
> > > BytesIn,
> > > > > > > etc), then try to move the leaders away from this broker,
> before
> > > doing
> > > > > > > an actual reassignment to change its preferred leader,  try to
> put
> > > this
> > > > > > > preferred_leader_blacklist in the Topic Level config, and run
> > > preferred
> > > > > > > leader election, and see whether CPU decreases for this broker,
> > > if
> > > > > > > yes, then do the reassignments to change the preferred leaders
> to
> > > be
> > > > > > > "permanent" (the topic may have many partitions like 256 that
> has
> > > quite
> > > > > > > a few of them having this broker as preferred leader).  So this
> > > Topic
> > > > > > > Level config is an easy way of doing trial and check the
> result.
> > > > > > >
> > > > > > >
> > > > > > > > You can add the below workaround as an item in the rejected
> > > alternatives section
> > > > > > > > "Reassigning all the topic/partitions which the intended
> broker
> > > is a
> > > > > > > > replica for."
> > > > > > >
> > > > > > > Updated the KIP-491.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > > > <satish.dugg...@gmail.com> wrote:
> > > > > > >
> > > > > > >  Thanks for the KIP. I have put my comments below.
> > > > > > >
> > > > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > > > >
> > > > > > > >> The following is the requirements this KIP is trying to
> > > accomplish:
> > > > > > >   The ability to add and remove the preferred leader
> deprioritized
> > > > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > > > >
> > > > > > > This can be moved to the"Proposed changes" section.
> > > > > > >
> > > > > > > >>The logic to determine the priority/order of which broker
> should
> > > be
> > > > > > > preferred leader should be modified.  The broker in the
> preferred
> > > leader
> > > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > > determining leadership.
> > > > > > >
> > > > > > > I believe there is no change required in the ordering of the
> > > preferred
> > > > > > > replica list. Brokers in the preferred leader blacklist are
> skipped
> > > > > > > until other brokers int he list are unavailable.
> > > > > > >
> > > > > > > >>The blacklist can be at the broker level. However, there
> might
> > > be use cases
> > > > > > > where a specific topic should blacklist particular brokers,
> which
> > > > > > > would be at the
> > > > > > > Topic level Config. For this use cases of this KIP, it seems
> that
> > > broker level
> > > > > > > blacklist would suffice.  Topic level preferred leader
> blacklist
> > > might
> > > > > > > be future enhancement work.
> > > > > > >
> > > > > > > I agree that the broker level preferred leader blacklist would
> be
> > > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > > preferred blacklist?
> > > > > > >
> > > > > > > You can add the below workaround as an item in the rejected
> > > alternatives section
> > > > > > > "Reassigning all the topic/partitions which the intended
> broker is
> > > a
> > > > > > > replica for."
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Satish.
> > > > > > >
> > > > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > > > <stanis...@confluent.io> wrote:
> > > > > > > >
> > > > > > > > Hey George,
> > > > > > > >
> > > > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > > > >
> > > > > > > > I was wondering whether we could achieve the same thing via
> the
> > > > > > > > kafka-reassign-partitions tool. As you had also said in the
> > > JIRA,  it is
> > > > > > > > true that this is currently very tedious with the tool. My
> > > thoughts are
> > > > > > > > that we could improve the tool and give it the notion of a
> > > "blacklisted
> > > > > > > > preferred leader".
> > > > > > > > This would have some benefits like:
> > > > > > > > - more fine-grained control over the blacklist. we may not
> want
> > > to
> > > > > > > > blacklist all the preferred leaders, as that would make the
> > > blacklisted
> > > > > > > > broker a follower of last resort which is not very useful. In
> > > the cases of
> > > > > > > > an underpowered AWS machine or a controller, you might
> overshoot
> > > and make
> > > > > > > > the broker very underutilized if you completely make it
> > > leaderless.
> > > > > > > > - is not permanent. If we are to have a blacklist leaders
> config,
> > > > > > > > rebalancing tools would also need to know about it and
> > > manipulate/respect
> > > > > > > > it to achieve a fair balance.
> > > > > > > > It seems like both problems are tied to balancing partitions,
> > > it's just
> > > > > > > > that KIP-491's use case wants to balance them against other
> > > factors in a
> > > > > > > > more nuanced way. It makes sense to have both be done from
> the
> > > same place
> > > > > > > >
> > > > > > > > To make note of the motivation section:
> > > > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > > > The recommended way to make a broker lose its leadership is
> to
> > > run a
> > > > > > > > reassignment on its partitions
> > > > > > > > > The cross-data center cluster has AWS cloud instances which
> > > have less
> > > > > > > > computing power
> > > > > > > > We recommend running Kafka on homogeneous machines. It would
> be
> > > cool if the
> > > > > > > > system supported more flexibility in that regard but that is
> > > more nuanced
> > > > > > > > and a preferred leader blacklist may not be the best first
> > > approach to the
> > > > > > > > issue
> > > > > > > >
> > > > > > > > Adding a new config which can fundamentally change the way
> > > replication is
> > > > > > > > done is complex, both for the system (the replication code is
> > > complex
> > > > > > > > enough) and the user. Users would have another potential
> config
> > > that could
> > > > > > > > backfire on them - e.g if left forgotten.
> > > > > > > >
> > > > > > > > Could you think of any downsides to implementing this
> > > functionality (or a
> > > > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > > > One downside I can see is that we would not have it handle
> new
> > > partitions
> > > > > > > > created after the "blacklist operation". As a first
> iteration I
> > > think that
> > > > > > > > may be acceptable
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Stanislav
> > > > > > > >
> > > > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> > > sql_consult...@yahoo.com.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >  Hi,
> > > > > > > > >
> > > > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > > > )
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > George
> > > > > > > > >
> > > > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > > > sql_consult...@yahoo.com.INVALID> wrote:
> > > > > > > > >
> > > > > > > > >  Hi,
> > > > > > > > >
> > > > > > > > > I have created KIP-491 (
> > > > > > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > )
> > > > > > > > > for putting a broker to the preferred leader blacklist or
> > > deprioritized
> > > > > > > > > list so when determining leadership,  it's moved to the
> lowest
> > > priority for
> > > > > > > > > some of the listed use-cases.
> > > > > > > > >
> > > > > > > > > Please provide your comments/feedbacks.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > George
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> > > Sancio (JIRA) <
> > > > > > > > > j...@apache.org>To: "sql_consult...@yahoo.com" <
> > > sql_consult...@yahoo.com>Sent:
> > > > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> > > [Commented]
> > > > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized
> list)
> > > > > > > > >
> > > > > > > > >    [
> > > > > > > > >
> > >
> https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > > > ]
> > > > > > > > >
> > > > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > > > ---------------------------------------------------
> > > > > > > > >
> > > > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > > > >
> > > > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > > > -----------------------------------------------
> > > > > > > > > >
> > > > > > > > > >                Key: KAFKA-8638
> > > > > > > > > >                URL:
> > > https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > > > >            Project: Kafka
> > > > > > > > > >          Issue Type: Improvement
> > > > > > > > > >          Components: config, controller, core
> > > > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > > > >            Reporter: GEORGE LI
> > > > > > > > > >            Assignee: GEORGE LI
> > > > > > > > > >            Priority: Major
> > > > > > > > > >
> > > > > > > > > > Currently, the kafka preferred leader election will pick
> the
> > > broker_id
> > > > > > > > > in the topic/partition replica assignments in a priority
> order
> > > when the
> > > > > > > > > broker is in ISR. The preferred leader is the broker id in
> the
> > > first
> > > > > > > > > position of replica. There are use-cases that, even the
> first
> > > broker in the
> > > > > > > > > replica assignment is in ISR, there is a need for it to be
> > > moved to the end
> > > > > > > > > of ordering (lowest priority) when deciding leadership
> during
> > > preferred
> > > > > > > > > leader election.
> > > > > > > > > > Let’s use topic/partition replica (1,2,3) as an example.
> 1
> > > is the
> > > > > > > > > preferred leader.  When preferred leadership is run, it
> will
> > > pick 1 as the
> > > > > > > > > leader if it's ISR, if 1 is not online and in ISR, then
> pick
> > > 2, if 2 is not
> > > > > > > > > in ISR, then pick 3 as the leader. There are use cases
> that,
> > > even 1 is in
> > > > > > > > > ISR, we would like it to be moved to the end of ordering
> > > (lowest priority)
> > > > > > > > > when deciding leadership during preferred leader election.
> > > Below is a list
> > > > > > > > > of use cases:
> > > > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> > > with last
> > > > > > > > > segments or latest offset without historical data (There is
> > > another effort
> > > > > > > > > on this), it's better for it to not serve leadership till
> it's
> > > caught-up.
> > > > > > > > > > * The cross-data center cluster has AWS instances which
> have
> > > less
> > > > > > > > > computing power than the on-prem bare metal machines.  We
> > > could put the AWS
> > > > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem
> brokers
> > > can be elected
> > > > > > > > > leaders, without changing the reassignments ordering of the
> > > replicas.
> > > > > > > > > > * If the broker_id 1 is constantly losing leadership
> after
> > > some time:
> > > > > > > > > "Flapping". we would want to exclude 1 to be a leader
> unless
> > > all other
> > > > > > > > > brokers of this topic/partition are offline.  The
> “Flapping”
> > > effect was
> > > > > > > > > seen in the past when 2 or more brokers were bad, when they
> > > lost leadership
> > > > > > > > > constantly/quickly, the sets of partition replicas they
> belong
> > > to will see
> > > > > > > > > leadership constantly changing.  The ultimate solution is
> to
> > > swap these bad
> > > > > > > > > hosts.  But for quick mitigation, we can also put the bad
> > > hosts in the
> > > > > > > > > Preferred Leader Blacklist to move the priority of its
> being
> > > elected as
> > > > > > > > > leaders to the lowest.
> > > > > > > > > > *  If the controller is busy serving an extra load of
> > > metadata requests
> > > > > > > > > and other tasks. we would like to put the controller's
> leaders
> > > to other
> > > > > > > > > brokers to lower its CPU load. currently bouncing to lose
> > > leadership would
> > > > > > > > > not work for Controller, because after the bounce, the
> > > controller fails
> > > > > > > > > over to another broker.
> > > > > > > > > > * Avoid bouncing broker in order to lose its leadership:
> it
> > > would be
> > > > > > > > > good if we have a way to specify which broker should be
> > > excluded from
> > > > > > > > > serving traffic/leadership (without changing the replica
> > > assignment
> > > > > > > > > ordering by reassignments, even though that's quick), and
> run
> > > preferred
> > > > > > > > > leader election.  A bouncing broker will cause temporary
> URP,
> > > and sometimes
> > > > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> > > can temporarily
> > > > > > > > > lose all its leadership, but if another broker (e.g.
> broker_id
> > > 2) fails or
> > > > > > > > > gets bounced, some of its leaderships will likely failover
> to
> > > broker_id 1
> > > > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> > > blacklist, then in
> > > > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker
> can
> > > take
> > > > > > > > > leadership.
> > > > > > > > > > The current work-around of the above is to change the
> > > topic/partition's
> > > > > > > > > replica reassignments to move the broker_id 1 from the
> first
> > > position to
> > > > > > > > > the last position and run preferred leader election. e.g.
> (1,
> > > 2, 3) => (2,
> > > > > > > > > 3, 1). This changes the replica reassignments, and we need
> to
> > > keep track of
> > > > > > > > > the original one and restore if things change (e.g.
> controller
> > > fails over
> > > > > > > > > to another broker, the swapped empty broker caught up).
> That’s
> > > a rather
> > > > > > > > > tedious task.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > This message was sent by Atlassian JIRA
> > > > > > > > > (v7.6.3#76005)
> >
>


-- 
Best,
Stanislav

Reply via email to