Re: [DISCUSS] Client-side Assignment for New Consumer

Jiangjie Qin Tue, 01 Sep 2015 13:59:35 -0700

Sorry about the format... It seems apache mail list does not suppor table.
Please refer to the following sheet.


https://docs.google.com/spreadsheets/d/1eKlQC1Ty74qVrIDH9vk7S6hEs_eJAyOufFUZaIU-k28/edit?usp=sharing

Thanks,

Jiangjie (Becket) Qin

On Tue, Sep 1, 2015 at 1:01 PM, Aditya Auradkar <[email protected]>
wrote:

> Hey -
>
> So what is remaining here? Becket did you send a followup email regarding
> the data volume. For some reason, I dont receive emails sent by you.
>
> Aditya
>
> On Mon, Aug 31, 2015 at 9:52 AM, Joel Koshy <[email protected]> wrote:
>
>> Thanks Becket - I think your follow-up email to the thread states our
>> concerns and positions clearly. I would really like to make sure we
>> are very clear and firm on the practical aspects of the changes they
>> are pushing for. We should continue to stay on top of this and make
>> sure they don't push in changes to the consumer in a hurry for 0.8.3
>> without due diligence.
>>
>> Joel
>>
>> On Sun, Aug 30, 2015 at 12:00 AM, Jiangjie Qin <[email protected]> wrote:
>> > Hi Joel,
>> >
>> > I was trying to calculate the number but found it might be better to run
>> > some actual tests because of the following reasons:
>> > 1. cross-colo network bandwidth usage is not clear to me yet.
>> > 2. The coordinators of the four mirror maker consumer group may or may
>> not
>> > reside on the same brokers. So the impact would be different.
>> > 3. I am not sure about the network bandwidth usage of normal traffic on
>> > broker. i.e. how much bandwidth is actually available for consumer
>> > rebalance. Currently the outbound traffic on broker are not balanced.
>> >
>> > The above factors can impact the actual rebalance time significantly.
>> So I
>> > just put a quick reply to the mail thread with some actual numbers we
>> have
>> > and wanted to update the test result later. But I agree we should do it
>> > soon.
>> >
>> > Thanks,
>> >
>> >  Jiangjie (Becket) Qin
>> >
>> > On Sat, Aug 29, 2015 at 9:40 PM, Joel Koshy <[email protected]>
>> wrote:
>> >>
>> >> No - I thought we agreed we would calculate the exact bandwidth (bytes)
>> >> required for a rebalance, its duration and thus whether it makes sense
>> or
>> >> not. I.e that we would come up with exact numbers for the scenarios of
>> >> interest in the email that becket sent
>> >>
>> >>
>> >> On Saturday, August 29, 2015, Kartik Paramasivam
>> >> <[email protected]> wrote:
>> >>>
>> >>> Isn't that what becket is also saying ?
>> >>>
>> >>> On Aug 28, 2015, at 10:12 PM, Joel Koshy <[email protected]> wrote:
>> >>>
>> >>> I thought we were going to run the numbers ourselves and tell them if
>> we
>> >>> are okay with it or not?
>> >>>
>> >>> ---------- Forwarded message ----------
>> >>> From: Jiangjie Qin <[email protected]>
>> >>> Date: Friday, August 28, 2015
>> >>> Subject: [DISCUSS] Client-side Assignment for New Consumer
>> >>> To: [email protected]
>> >>>
>> >>>
>> >>> Hi Neha,
>> >>>
>> >>> Following are some numbers we have in the pipeline. It would be very
>> >>> helpful to see how it goes with the proposed protocol. We will try to
>> do
>> >>> some tests with the current patch as well. Please also let us know if
>> you
>> >>> want further information.
>> >>>
>> >>> 32 brokers, 1Gbps NIC
>> >>> 547 topics
>> >>> 27 chars average topic name length
>> >>> 2-3 consumers for each topic
>> >>>
>> >>> Four 26-node mirror maker instances (Four different consumer groups).
>> >>> Each
>> >>> node has 4 consumers. (Each mirror maker instance has 104 consumers)
>> >>> We are actually using selective copy, so we have a big whitelist for
>> each
>> >>> mirror maker, copying about 100 topics (We expect it to grow to a
>> couple
>> >>> of
>> >>> hundreds).
>> >>> The mirror makers are co-located with target cluster, so the consumer
>> >>> traffic go through the WAN.
>> >>>
>> >>> We have 5 to 6 wildcard consumers consuming from all the topics.
>> >>>
>> >>> The topic creation frequency is not high now, roughly about 1 / day.
>> >>>
>> >>> The scenarios we are interested in are:
>> >>> 1. The time for one round of rebalance.
>> >>> 2. The time for a rolling bounce of mirror maker.
>> >>> 3. For wildcard topic, does metadata sync up cause problem.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Jiangjie (Becket) Qin
>> >>>
>> >>>
>> >>> On Fri, Aug 28, 2015 at 1:24 PM, Joel Koshy <[email protected]>
>> wrote:
>> >>>
>> >>> > Another use-case I was thinking of was something like rack-aware
>> >>> > assignment of partitions to clients. This would require some
>> >>> > additional topic metadata to be propagated to and from the
>> coordinator
>> >>> > and you would need some way to resolve conflicts for such
>> strategies.
>> >>> > I think that could be addressed by attaching a generation id to the
>> >>> > metadata and use that (i.e., pick the highest) in order to resolve
>> >>> > conflicts without another round of join-group requests.
>> >>> >
>> >>> > Likewise, without delete/recreate, partition counts are a sort of
>> >>> > generation id since they are non-decreasing. If we need to account
>> for
>> >>> > delete/recreate that could perhaps be addressed by an explicit
>> >>> > (per-topic) generation id attached to each topic in the metadata
>> blob.
>> >>> > Does that make sense? I think that covers my concerns wrt the
>> >>> > split-brain issues.
>> >>> >
>> >>> > I'm still a bit wary of the n^2*m sized rebroadcast of all the
>> >>> > metadata - mainly because for various reasons at LinkedIn, we are
>> >>> > actually using large explicit whitelists (and not wildcards) in
>> >>> > several of our mirroring pipelines. At this point I feel that is a
>> >>> > reasonable cost to pay for having all the logic in one place (i.e.,
>> >>> > client side) but I would like to think a bit more on that.
>> >>> >
>> >>> > Joel
>> >>> >
>> >>> >
>> >>> > On Fri, Aug 28, 2015 at 1:02 PM, Onur Karaman
>> >>> > <[email protected]> wrote:
>> >>> > > From what I understand, the "largest number of partitions" trick
>> is
>> >>> > > based
>> >>> > > on the assumption that topics can only expand their partitions.
>> What
>> >>> > > happens when a topic gets deleted and recreated? This breaks that
>> >>> > > assumption.
>> >>> > >
>> >>> > > On Fri, Aug 28, 2015 at 6:33 AM, Neha Narkhede <[email protected]
>> >
>> >>> > wrote:
>> >>> > >
>> >>> > >> Thanks for re-reviewing Joel.
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >> On Fri, Aug 28, 2015 at 2:51 AM -0700, "Joel Koshy" <
>> >>> > [email protected]>
>> >>> > >> wrote:
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >> > I think we think this proposal addresses 100% of the split
>> brain
>> >>> > issues
>> >>> > >> > ever seen in the ZK-based protocol, but I think you think there
>> >>> > >> > are
>> >>> > still
>> >>> > >> > issues. Can you explain what your thinking of and when you
>> think
>> >>> > >> > it
>> >>> > would
>> >>> > >> > happen? I want to make sure you aren't assuming
>> >>> > client-side=>split-brain
>> >>> > >> > since I think that is totally not the case.
>> >>> > >>
>> >>> > >> Yes I had concluded that client-side assignment would still
>> result
>> >>> > >> in
>> >>> > >> split-brain wrt partition counts, but I overlooked a key
>> sentence in
>> >>> > >> the wiki - i.e., that the assignment algorithm for consumers can
>> >>> > >> just
>> >>> > >> use the largest number of partitions for each topic reported by
>> any
>> >>> > >> of
>> >>> > >> the consumers. i.e., I assumed that consumers would just fail
>> >>> > >> rebalance if the partition counts were inconsistent but that is
>> not
>> >>> > >> the case since this conflict can be easily resolved as described
>> >>> > >> without further join-group requests. Sorry about that. There is
>> >>> > >> still
>> >>> > >> the issue of the coordinator having to send back n*m worth of
>> >>> > >> metadata, but that was not my biggest concern. I'll look over it
>> >>> > >> again
>> >>> > >> and reply back tomorrow.
>> >>> > >>
>> >>> > >> Joel
>> >>> > >>
>> >>> > >> On Thu, Aug 27, 2015 at 2:55 PM, Jay Kreps  wrote:
>> >>> > >> > Hey Joel,
>> >>> > >> >
>> >>> > >> > I really don't think we should do both. There are pros and cons
>> >>> > >> > but we
>> >>> > >> > should make a decision and work on operationalizing one
>> approach.
>> >>> > Much of
>> >>> > >> > really making something like this work is getting all the bugs
>> >>> > >> > out,
>> >>> > >> getting
>> >>> > >> > monitoring in place, getting rigorous system tests in place.
>> >>> > >> > Trying
>> >>> > to do
>> >>> > >> > those things twice with the same resources will just mean we do
>> >>> > >> > them
>> >>> > half
>> >>> > >> > as well. I also think this buys nothing from the user's point
>> of
>> >>> > >> view--they
>> >>> > >> > want co-ordination that works correctly, the debate we are
>> having
>> >>> > >> > is
>> >>> > >> purely
>> >>> > >> > a "how should we build that" debate. So this is really not the
>> >>> > >> > kind of
>> >>> > >> > thing we'd want to make pluggable and if we did that would just
>> >>> > >> complicate
>> >>> > >> > life for the user.
>> >>> > >> >
>> >>> > >> > I think we think this proposal addresses 100% of the split
>> brain
>> >>> > issues
>> >>> > >> > ever seen in the ZK-based protocol, but I think you think there
>> >>> > >> > are
>> >>> > still
>> >>> > >> > issues. Can you explain what your thinking of and when you
>> think
>> >>> > >> > it
>> >>> > would
>> >>> > >> > happen? I want to make sure you aren't assuming
>> >>> > client-side=>split-brain
>> >>> > >> > since I think that is totally not the case.
>> >>> > >> >
>> >>> > >> > With respect to "herd issues" I actually think all the
>> proposals
>> >>> > address
>> >>> > >> > this by scaling the co-ordinator out to all nodes and making
>> the
>> >>> > >> > co-ordination vastly cheaper. No proposal, of course, gets rid
>> of
>> >>> > >> > the
>> >>> > >> fact
>> >>> > >> > that all clients rejoin at once when there is a membership
>> change,
>> >>> > >> > but
>> >>> > >> that
>> >>> > >> > is kind of fundamental to the problem.
>> >>> > >> >
>> >>> > >> > -Jay
>> >>> > >> >
>> >>> > >> > On Thu, Aug 27, 2015 at 2:02 PM, Joel Koshy  wrote:
>> >>> > >> >
>> >>> > >> >> I actually feel these set of tests (whatever they may be) are
>> >>> > somewhat
>> >>> > >> >> irrelevant here. My main concern with the current client-side
>> >>> > proposal
>> >>> > >> >> (i.e., without Becket's follow-up suggestions) is that it
>> makes a
>> >>> > >> >> significant compromise to the original charter of the new
>> >>> > >> >> consumer -
>> >>> > >> >> i.e., reduce/eliminate herd and split brain problems in both
>> >>> > >> >> group
>> >>> > >> >> management and partition assignment. I understand the need for
>> >>> > >> >> client-side partition assignment in some use cases (which we
>> are
>> >>> > >> >> also
>> >>> > >> >> interested in), but I also think we should make every effort
>> to
>> >>> > >> >> keep
>> >>> > >> >> full server-side coordination for the remaining (majority) of
>> use
>> >>> > >> >> cases especially if it does not complicate the protocol. The
>> >>> > >> >> proposed
>> >>> > >> >> changes do not complicate the protocol IMO - i.e., there is no
>> >>> > further
>> >>> > >> >> modification to the request/response formats beyond the
>> current
>> >>> > >> >> client-side proposal. It only involves a trivial
>> reinterpretation
>> >>> > >> >> of
>> >>> > >> >> the content of the protocol metadata field.
>> >>> > >> >>
>> >>> > >> >> Joel
>> >>> > >> >>
>> >>> > >> >> On Wed, Aug 26, 2015 at 9:33 PM, Neha Narkhede  wrote:
>> >>> > >> >> > Hey Becket,
>> >>> > >> >> >
>> >>> > >> >> > In that case, the broker side partition assignment would be
>> >>> > >> >> > ideal
>> >>> > >> because
>> >>> > >> >> >> it avoids
>> >>> > >> >> >> issues like metadata inconsistency / split brain /
>> exploding
>> >>> > >> >> subscription
>> >>> > >> >> >> set propagation.
>> >>> > >> >> >
>> >>> > >> >> >
>> >>> > >> >> > As per our previous discussions regarding each of those
>> >>> > >> >> > concerns
>> >>> > >> >> (referring
>> >>> > >> >> > to this email thread, KIP calls and JIRA comments), we are
>> >>> > >> >> > going to
>> >>> > >> run a
>> >>> > >> >> > set of tests using the LinkedIn deployment numbers that we
>> will
>> >>> > wait
>> >>> > >> for
>> >>> > >> >> > you to share. The purpose is to see if those concerns are
>> >>> > >> >> > really
>> >>> > >> valid or
>> >>> > >> >> > not. I'd prefer to see that before making any more changes
>> that
>> >>> > will
>> >>> > >> >> > complicate the protocol.
>> >>> > >> >> >
>> >>> > >> >> > On Wed, Aug 26, 2015 at 4:57 PM, Jiangjie Qin > >
>> >>> > >> >> > wrote:
>> >>> > >> >> >
>> >>> > >> >> >> Hi folks,
>> >>> > >> >> >>
>> >>> > >> >> >> After further discussion in LinkedIn, we found that while
>> >>> > >> >> >> having a
>> >>> > >> more
>> >>> > >> >> >> general group management protocol is very useful, the vast
>> >>> > majority
>> >>> > >> of
>> >>> > >> >> the
>> >>> > >> >> >> clients will not use customized partition assignment
>> strategy.
>> >>> > >> >> >> In
>> >>> > >> that
>> >>> > >> >> >> case, the broker side partition assignment would be ideal
>> >>> > >> >> >> because
>> >>> > it
>> >>> > >> >> avoids
>> >>> > >> >> >> issues like metadata inconsistency / split brain /
>> exploding
>> >>> > >> >> subscription
>> >>> > >> >> >> set propagation.
>> >>> > >> >> >>
>> >>> > >> >> >> So we have the following proposal that satisfies the
>> majority
>> >>> > >> >> >> of
>> >>> > the
>> >>> > >> >> >> clients' needs without changing the currently proposed
>> binary
>> >>> > >> protocol.
>> >>> > >> >> >> i.e., Continue to support broker-side assignment if the
>> >>> > >> >> >> assignment
>> >>> > >> >> strategy
>> >>> > >> >> >> is recognized by the coordinator.
>> >>> > >> >> >>
>> >>> > >> >> >> 1. Keep the binary protocol as currently proposed.
>> >>> > >> >> >>
>> >>> > >> >> >> 2. Change the way we interpret ProtocolMetadata:
>> >>> > >> >> >> 2.1 On consumer side, change partition.assignment.strategy
>> to
>> >>> > >> >> >> partition.assignor.class. Implement the something like the
>> >>> > following
>> >>> > >> >> >> PartitionAssignor Interface:
>> >>> > >> >> >>
>> >>> > >> >> >> public interface PartitionAssignor {
>> >>> > >> >> >>   List protocolTypes();
>> >>> > >> >> >>   byte[] protocolMetadata();
>> >>> > >> >> >>   // return the Topic->List map that are assigned to this
>> >>> > >> >> >> consumer.
>> >>> > >> >> >>   List assignPartitions(String protocolType, byte[]
>> >>> > >> >> >> responseProtocolMetadata);
>> >>> > >> >> >> }
>> >>> > >> >> >>
>> >>> > >> >> >> public abstract class AbstractPartitionAssignor implements
>> >>> > >> >> >> PartitionAssignor {
>> >>> > >> >> >>   protected final KafkaConsumer consumer;
>> >>> > >> >> >>   AbstractPartitionAssignor(KafkaConsumer consumer) {
>> >>> > >> >> >>     this.consumer = consumer;
>> >>> > >> >> >>   }
>> >>> > >> >> >> }
>> >>> > >> >> >>
>> >>> > >> >> >> 2.2 The ProtocolMetadata in JoinGroupRequest will be
>> >>> > >> >> >> partitionAssignor.protocolMetadata(). When
>> >>> > partition.assignor.class
>> >>> > >> is
>> >>> > >> >> >> "range" or "roundrobin", the ProtocolMetadata in
>> >>> > >> >> >> JoinGroupRequest
>> >>> > >> will
>> >>> > >> >> be a
>> >>> > >> >> >> JSON subscription set. ("range", "roundrobin" will be
>> reserved
>> >>> > >> words, we
>> >>> > >> >> >> can also consider reserving some Prefix such as "broker-"
>> to
>> >>> > >> >> >> be
>> >>> > more
>> >>> > >> >> clear)
>> >>> > >> >> >> 2.3 On broker side when ProtocolType is "range" or
>> >>> > >> >> >> "roundroubin",
>> >>> > >> >> >> coordinator will parse the ProtocolMetadata in the
>> >>> > JoinGroupRequest
>> >>> > >> and
>> >>> > >> >> >> assign the partitions for consumers. In the
>> JoinGroupResponse,
>> >>> > >> >> >> the
>> >>> > >> >> >> ProtocolMetadata will be the global assignment of
>> partitions.
>> >>> > >> >> >> 2.4 On client side, after receiving the JoinGroupResponse,
>> >>> > >> >> >> partitionAssignor.assignPartitions() will be invoked to
>> return
>> >>> > >> >> >> the
>> >>> > >> >> actual
>> >>> > >> >> >> assignment. If the assignor is RangeAssignor or
>> >>> > RoundRobinAssignor,
>> >>> > >> they
>> >>> > >> >> >> will parse the assignment from the ProtocolMetadata
>> returned
>> >>> > >> >> >> by
>> >>> > >> >> >> coordinator.
>> >>> > >> >> >>
>> >>> > >> >> >> This approach has a few merits:
>> >>> > >> >> >> 1. Does not change the proposed binary protocol, which is
>> >>> > >> >> >> still
>> >>> > >> general.
>> >>> > >> >> >> 2. The majority of the consumers will not suffer from
>> >>> > >> >> >> inconsistent
>> >>> > >> >> metadata
>> >>> > >> >> >> / split brain / exploding subscription set propagation.
>> This
>> >>> > >> >> >> is
>> >>> > >> >> >> specifically to deal with the issue that the current
>> proposal
>> >>> > caters
>> >>> > >> to
>> >>> > >> >> a
>> >>> > >> >> >> 20% use-case while adversely impacting the more common 80%
>> >>> > use-cases.
>> >>> > >> >> >> 3. Easy to implement. The only thing needed is implement a
>> >>> > >> partitioner
>> >>> > >> >> >> class. For most users, the default range and roundrobin
>> >>> > partitioner
>> >>> > >> are
>> >>> > >> >> >> good enough.
>> >>> > >> >> >>
>> >>> > >> >> >> Thoughts?
>> >>> > >> >> >>
>> >>> > >> >> >> Thanks,
>> >>> > >> >> >>
>> >>> > >> >> >> Jiangjie (Becket) Qin
>> >>> > >> >> >>
>> >>> > >> >> >> On Tue, Aug 18, 2015 at 2:51 PM, Jason Gustafson
>> >>> > >> >> >> wrote:
>> >>> > >> >> >>
>> >>> > >> >> >> > Follow-up from the kip call:
>> >>> > >> >> >> >
>> >>> > >> >> >> > 1. Onur brought up the question of whether this protocol
>> >>> > provides
>> >>> > >> >> enough
>> >>> > >> >> >> > coordination capabilities to be generally useful in
>> practice
>> >>> > >> >> >> > (is
>> >>> > >> that
>> >>> > >> >> >> > accurate, Onur?). If it doesn't, then each use case would
>> >>> > probably
>> >>> > >> >> need a
>> >>> > >> >> >> > dependence on zookeeper anyway, and we haven't really
>> gained
>> >>> > >> anything.
>> >>> > >> >> >> The
>> >>> > >> >> >> > group membership provided by this protocol is a useful
>> >>> > >> >> >> > primitive
>> >>> > >> for
>> >>> > >> >> >> > coordination, but it's limited in the sense that
>> everything
>> >>> > shared
>> >>> > >> >> among
>> >>> > >> >> >> > the group has to be communicated at the time the group is
>> >>> > created.
>> >>> > >> If
>> >>> > >> >> any
>> >>> > >> >> >> > shared data changes, then the only way the group can
>> ensure
>> >>> > >> agreement
>> >>> > >> >> is
>> >>> > >> >> >> to
>> >>> > >> >> >> > force a rebalance. This is expensive since all members
>> must
>> >>> > stall
>> >>> > >> >> while
>> >>> > >> >> >> the
>> >>> > >> >> >> > rebalancing takes place. As we have also seen, there is a
>> >>> > practical
>> >>> > >> >> limit
>> >>> > >> >> >> > on the amount of metadata that can be sent through this
>> >>> > >> >> >> > protocol
>> >>> > >> when
>> >>> > >> >> >> > groups get a little larger. This protocol is therefore
>> not
>> >>> > >> suitable to
>> >>> > >> >> >> > cases which require frequent communication or which
>> require
>> >>> > >> >> >> > a
>> >>> > large
>> >>> > >> >> >> amount
>> >>> > >> >> >> > of data to be communicated. For the use cases listed on
>> the
>> >>> > wiki,
>> >>> > >> >> neither
>> >>> > >> >> >> > of these appear to be an issue, but there may be other
>> >>> > limitations
>> >>> > >> >> which
>> >>> > >> >> >> > would limit reuse of the protocol. Perhaps it would be
>> >>> > sufficient
>> >>> > >> to
>> >>> > >> >> >> sketch
>> >>> > >> >> >> > how these cases might work?
>> >>> > >> >> >> >
>> >>> > >> >> >> > 2. We talked a little bit about the issue of metadata
>> churn.
>> >>> > Becket
>> >>> > >> >> >> brought
>> >>> > >> >> >> > up the interesting point that not only do we depend on
>> topic
>> >>> > >> metadata
>> >>> > >> >> >> > changing relatively infrequently, but we also expect
>> timely
>> >>> > >> agreement
>> >>> > >> >> >> among
>> >>> > >> >> >> > the brokers on what that metadata is. To resolve this, we
>> >>> > >> >> >> > can
>> >>> > have
>> >>> > >> the
>> >>> > >> >> >> > consumers fetch metadata from the coordinator. We still
>> >>> > >> >> >> > depend
>> >>> > on
>> >>> > >> >> topic
>> >>> > >> >> >> > metadata not changing frequently, but this should resolve
>> >>> > >> >> >> > any
>> >>> > >> >> >> disagreement
>> >>> > >> >> >> > among the brokers themselves. In fact, since we expect
>> that
>> >>> > >> >> disagreement
>> >>> > >> >> >> is
>> >>> > >> >> >> > relatively rare, we can have the consumers fetch from the
>> >>> > >> coordinator
>> >>> > >> >> >> only
>> >>> > >> >> >> > when when a disagreement occurs. The nice thing about
>> this
>> >>> > >> proposal is
>> >>> > >> >> >> that
>> >>> > >> >> >> > it doesn't affect the join group semantics, so the
>> >>> > >> >> >> > coordinator
>> >>> > >> would
>> >>> > >> >> >> remain
>> >>> > >> >> >> > oblivious to the metadata used by the group for
>> agreement.
>> >>> > Also, if
>> >>> > >> >> >> > metadata churn becomes an issue, it might be possible to
>> >>> > >> >> >> > have
>> >>> > the
>> >>> > >> >> >> > coordinator provide a snapshot for the group to ensure
>> that
>> >>> > >> >> >> > a
>> >>> > >> >> generation
>> >>> > >> >> >> > would be able to reach agreement (this would probably
>> >>> > >> >> >> > require
>> >>> > >> adding
>> >>> > >> >> >> > groupId/generation to the metadata request).
>> >>> > >> >> >> >
>> >>> > >> >> >> > 3. We talked briefly about support for multiple
>> protocols in
>> >>> > >> >> >> > the
>> >>> > >> join
>> >>> > >> >> >> group
>> >>> > >> >> >> > request in order to allow changing the assignment
>> strategy
>> >>> > without
>> >>> > >> >> >> > downtime. I think it's a little doubtful that this would
>> get
>> >>> > much
>> >>> > >> use
>> >>> > >> >> in
>> >>> > >> >> >> > practice, but I agree it's a nice option to have on the
>> >>> > >> >> >> > table.
>> >>> > An
>> >>> > >> >> >> > alternative, for the sake of argument, is to have each
>> >>> > >> >> >> > member
>> >>> > >> provide
>> >>> > >> >> >> only
>> >>> > >> >> >> > one version of the protocol, and to let the coordinator
>> >>> > >> >> >> > choose
>> >>> > the
>> >>> > >> >> >> protocol
>> >>> > >> >> >> > with the largest number of supporters. All members which
>> >>> > >> >> >> > can't
>> >>> > >> support
>> >>> > >> >> >> the
>> >>> > >> >> >> > selected protocol would be kicked out of the group. The
>> >>> > >> >> >> > drawback
>> >>> > >> in a
>> >>> > >> >> >> > rolling upgrade is that the total capacity of the group
>> >>> > >> >> >> > would be
>> >>> > >> >> >> > momentarily halved. It would also be a little tricky to
>> >>> > >> >> >> > handle
>> >>> > the
>> >>> > >> >> case
>> >>> > >> >> >> of
>> >>> > >> >> >> > retrying when a consumer is kicked out of the group. We
>> >>> > >> >> >> > wouldn't
>> >>> > >> want
>> >>> > >> >> it
>> >>> > >> >> >> to
>> >>> > >> >> >> > be able to effect a rebalance, for example, if it would
>> just
>> >>> > >> >> >> > be
>> >>> > >> kicked
>> >>> > >> >> >> out
>> >>> > >> >> >> > again. That would probably complicate the group
>> management
>> >>> > logic on
>> >>> > >> >> the
>> >>> > >> >> >> > coordinator.
>> >>> > >> >> >> >
>> >>> > >> >> >> >
>> >>> > >> >> >> > Thanks,
>> >>> > >> >> >> > Jason
>> >>> > >> >> >> >
>> >>> > >> >> >> >
>> >>> > >> >> >> > On Tue, Aug 18, 2015 at 11:16 AM, Jiangjie Qin
>> >>> > >> >> > >> >
>> >>> > >> >> >> > wrote:
>> >>> > >> >> >> >
>> >>> > >> >> >> > > Jun,
>> >>> > >> >> >> > >
>> >>> > >> >> >> > > Yes, I agree. If the metadata can be synced quickly
>> there
>> >>> > should
>> >>> > >> >> not be
>> >>> > >> >> >> > an
>> >>> > >> >> >> > > issue. It just occurred to me that there is a proposal
>> to
>> >>> > allow
>> >>> > >> >> >> consuming
>> >>> > >> >> >> > > from followers in ISR, that could potentially cause
>> more
>> >>> > frequent
>> >>> > >> >> >> > metadata
>> >>> > >> >> >> > > change for consumers. Would that be an issue?
>> >>> > >> >> >> > >
>> >>> > >> >> >> > > Thanks,
>> >>> > >> >> >> > >
>> >>> > >> >> >> > > Jiangjie (Becket) Qin
>> >>> > >> >> >> > >
>> >>> > >> >> >> > > On Tue, Aug 18, 2015 at 10:22 AM, Jason Gustafson <
>> >>> > >> >> [email protected]>
>> >>> > >> >> >> > > wrote:
>> >>> > >> >> >> > >
>> >>> > >> >> >> > > > Hi Jun,
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > Answers below:
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > 1. When there are multiple common protocols in the
>> >>> > >> >> JoinGroupRequest,
>> >>> > >> >> >> > > which
>> >>> > >> >> >> > > > one would the coordinator pick?
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > I was intending to use the list to indicate
>> preference.
>> >>> > >> >> >> > > > If
>> >>> > all
>> >>> > >> >> group
>> >>> > >> >> >> > > > members support protocols ["A", "B"] in that order,
>> then
>> >>> > >> >> >> > > > we
>> >>> > >> will
>> >>> > >> >> >> choose
>> >>> > >> >> >> > > > "A." If some support ["B", "A"], then we would either
>> >>> > >> >> >> > > > choose
>> >>> > >> >> based on
>> >>> > >> >> >> > > > respective counts or just randomly. The main use
>> case of
>> >>> > >> >> supporting
>> >>> > >> >> >> the
>> >>> > >> >> >> > > > list is for rolling upgrades when a change is made to
>> >>> > >> >> >> > > > the
>> >>> > >> >> assignment
>> >>> > >> >> >> > > > strategy. In that case, the new assignment strategy
>> >>> > >> >> >> > > > would be
>> >>> > >> >> listed
>> >>> > >> >> >> > first
>> >>> > >> >> >> > > > in the upgraded client. I think it's debatable
>> whether
>> >>> > >> >> >> > > > this
>> >>> > >> >> feature
>> >>> > >> >> >> > would
>> >>> > >> >> >> > > > get much use in practice, so we might consider
>> dropping
>> >>> > >> >> >> > > > it.
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > 2. If the protocols don't agree, the group
>> construction
>> >>> > fails.
>> >>> > >> >> What
>> >>> > >> >> >> > > exactly
>> >>> > >> >> >> > > > does it mean? Do we send an error in every
>> >>> > >> >> >> > > > JoinGroupResponse
>> >>> > >> and
>> >>> > >> >> >> remove
>> >>> > >> >> >> > > all
>> >>> > >> >> >> > > > members in the group in the coordinator?
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > Yes, that is right. It would be handled similarly to
>> >>> > >> inconsistent
>> >>> > >> >> >> > > > assignment strategies in the current protocol. The
>> >>> > coordinator
>> >>> > >> >> >> returns
>> >>> > >> >> >> > an
>> >>> > >> >> >> > > > error in each join group response, and the client
>> >>> > >> >> >> > > > propagates
>> >>> > >> the
>> >>> > >> >> >> error
>> >>> > >> >> >> > to
>> >>> > >> >> >> > > > the user.
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > 3. Consumer embedded protocol: The proposal has two
>> >>> > different
>> >>> > >> >> formats
>> >>> > >> >> >> > of
>> >>> > >> >> >> > > > subscription depending on whether wildcards are used
>> or
>> >>> > >> >> >> > > > not.
>> >>> > >> This
>> >>> > >> >> >> > seems a
>> >>> > >> >> >> > > > bit complicated. Would it be better to always use the
>> >>> > metadata
>> >>> > >> >> hash?
>> >>> > >> >> >> > The
>> >>> > >> >> >> > > > clients know the subscribed topics already. This way,
>> >>> > >> >> >> > > > the
>> >>> > >> client
>> >>> > >> >> code
>> >>> > >> >> >> > > > behaves the same whether wildcards are used or not.
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > Yeah, I think this is possible (Neha also suggested
>> it).
>> >>> > >> >> >> > > > I
>> >>> > >> haven't
>> >>> > >> >> >> > > updated
>> >>> > >> >> >> > > > the wiki yet, but the patch I started working on uses
>> >>> > >> >> >> > > > only
>> >>> > the
>> >>> > >> >> >> metadata
>> >>> > >> >> >> > > > hash. In the case that an explicit topic list is
>> >>> > >> >> >> > > > provided,
>> >>> > the
>> >>> > >> >> hash
>> >>> > >> >> >> > just
>> >>> > >> >> >> > > > covers the metadata for those topics.
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > Thanks,
>> >>> > >> >> >> > > > Jason
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > On Tue, Aug 18, 2015 at 10:06 AM, Jun Rao
>> >>> > >> >> wrote:
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > > > > Jason,
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > Thanks for the writeup. A few comments below.
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > 1. When there are multiple common protocols in the
>> >>> > >> >> >> JoinGroupRequest,
>> >>> > >> >> >> > > > which
>> >>> > >> >> >> > > > > one would the coordinator pick?
>> >>> > >> >> >> > > > > 2. If the protocols don't agree, the group
>> >>> > >> >> >> > > > > construction
>> >>> > >> fails.
>> >>> > >> >> What
>> >>> > >> >> >> > > > exactly
>> >>> > >> >> >> > > > > does it mean? Do we send an error in every
>> >>> > JoinGroupResponse
>> >>> > >> and
>> >>> > >> >> >> > remove
>> >>> > >> >> >> > > > all
>> >>> > >> >> >> > > > > members in the group in the coordinator?
>> >>> > >> >> >> > > > > 3. Consumer embedded protocol: The proposal has two
>> >>> > different
>> >>> > >> >> >> formats
>> >>> > >> >> >> > > of
>> >>> > >> >> >> > > > > subscription depending on whether wildcards are
>> used
>> >>> > >> >> >> > > > > or
>> >>> > not.
>> >>> > >> >> This
>> >>> > >> >> >> > > seems a
>> >>> > >> >> >> > > > > bit complicated. Would it be better to always use
>> the
>> >>> > >> metadata
>> >>> > >> >> >> hash?
>> >>> > >> >> >> > > The
>> >>> > >> >> >> > > > > clients know the subscribed topics already. This
>> way,
>> >>> > >> >> >> > > > > the
>> >>> > >> client
>> >>> > >> >> >> code
>> >>> > >> >> >> > > > > behaves the same whether wildcards are used or not.
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > Jiangjie,
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > With respect to rebalance churns due to topics
>> being
>> >>> > >> >> >> created/deleted.
>> >>> > >> >> >> > > > With
>> >>> > >> >> >> > > > > the new consumer, the rebalance can probably settle
>> >>> > >> >> >> > > > > within
>> >>> > >> 200ms
>> >>> > >> >> >> when
>> >>> > >> >> >> > > > there
>> >>> > >> >> >> > > > > is a topic change. So, as long as we are not
>> changing
>> >>> > topic
>> >>> > >> more
>> >>> > >> >> >> > than 5
>> >>> > >> >> >> > > > > times per sec, there shouldn't be constant churns,
>> >>> > >> >> >> > > > > right?
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > Thanks,
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > Jun
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > On Tue, Aug 11, 2015 at 1:19 PM, Jason Gustafson <
>> >>> > >> >> >> [email protected]
>> >>> > >> >> >> > >
>> >>> > >> >> >> > > > > wrote:
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > > > > Hi Kafka Devs,
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > > One of the nagging issues in the current design
>> of
>> >>> > >> >> >> > > > > > the
>> >>> > new
>> >>> > >> >> >> consumer
>> >>> > >> >> >> > > has
>> >>> > >> >> >> > > > > > been the need to support a variety of assignment
>> >>> > >> strategies.
>> >>> > >> >> >> We've
>> >>> > >> >> >> > > > > > encountered this in particular in the design of
>> >>> > >> >> >> > > > > > copycat
>> >>> > and
>> >>> > >> >> the
>> >>> > >> >> >> > > > > processing
>> >>> > >> >> >> > > > > > framework (KIP-28). From what I understand, Samza
>> >>> > >> >> >> > > > > > also
>> >>> > has
>> >>> > >> a
>> >>> > >> >> >> number
>> >>> > >> >> >> > > of
>> >>> > >> >> >> > > > > use
>> >>> > >> >> >> > > > > > cases with custom assignment needs. The new
>> consumer
>> >>> > >> protocol
>> >>> > >> >> >> > > supports
>> >>> > >> >> >> > > > > new
>> >>> > >> >> >> > > > > > assignment strategies by hooking them into the
>> >>> > >> >> >> > > > > > broker.
>> >>> > For
>> >>> > >> >> many
>> >>> > >> >> >> > > > > > environments, this is a major pain and in some
>> >>> > >> >> >> > > > > > cases, a
>> >>> > >> >> >> > non-starter.
>> >>> > >> >> >> > > It
>> >>> > >> >> >> > > > > > also challenges the validation that the
>> coordinator
>> >>> > >> >> >> > > > > > can
>> >>> > >> >> provide.
>> >>> > >> >> >> > For
>> >>> > >> >> >> > > > > > example, some assignment strategies call for
>> >>> > >> >> >> > > > > > partitions
>> >>> > to
>> >>> > >> be
>> >>> > >> >> >> > > assigned
>> >>> > >> >> >> > > > > > multiple times, which means that the coordinator
>> can
>> >>> > only
>> >>> > >> >> check
>> >>> > >> >> >> > that
>> >>> > >> >> >> > > > > > partitions have been assigned at least once.
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > > To solve these issues, we'd like to propose
>> moving
>> >>> > >> assignment
>> >>> > >> >> to
>> >>> > >> >> >> > the
>> >>> > >> >> >> > > > > > client. I've written a wiki which outlines some
>> >>> > >> >> >> > > > > > protocol
>> >>> > >> >> changes
>> >>> > >> >> >> to
>> >>> > >> >> >> > > > > achieve
>> >>> > >> >> >> > > > > > this:
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > >
>> >>> > >> >> >> >
>> >>> > >> >> >>
>> >>> > >> >>
>> >>> > >>
>> >>> >
>> >>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Client-side+Assignment+Proposal
>> >>> > >> >> >> > > > > > .
>> >>> > >> >> >> > > > > > To summarize briefly, instead of the coordinator
>> >>> > assigning
>> >>> > >> the
>> >>> > >> >> >> > > > partitions
>> >>> > >> >> >> > > > > > itself, all subscriptions are forwarded to each
>> >>> > >> >> >> > > > > > member
>> >>> > of
>> >>> > >> the
>> >>> > >> >> >> group
>> >>> > >> >> >> > > > which
>> >>> > >> >> >> > > > > > then decides independently which partitions it
>> >>> > >> >> >> > > > > > should
>> >>> > >> consume.
>> >>> > >> >> >> The
>> >>> > >> >> >> > > > > protocol
>> >>> > >> >> >> > > > > > provides a mechanism for the coordinator to
>> validate
>> >>> > that
>> >>> > >> all
>> >>> > >> >> >> > > consumers
>> >>> > >> >> >> > > > > use
>> >>> > >> >> >> > > > > > the same assignment strategy, but it does not
>> ensure
>> >>> > that
>> >>> > >> the
>> >>> > >> >> >> > > resulting
>> >>> > >> >> >> > > > > > assignment is "correct." This provides a powerful
>> >>> > >> capability
>> >>> > >> >> for
>> >>> > >> >> >> > > users
>> >>> > >> >> >> > > > to
>> >>> > >> >> >> > > > > > control the full data flow on the client side.
>> They
>> >>> > control
>> >>> > >> >> how
>> >>> > >> >> >> > data
>> >>> > >> >> >> > > is
>> >>> > >> >> >> > > > > > written to partitions through the Partitioner
>> >>> > >> >> >> > > > > > interface
>> >>> > and
>> >>> > >> >> they
>> >>> > >> >> >> > > > control
>> >>> > >> >> >> > > > > > how data is consumed through the assignment
>> >>> > >> >> >> > > > > > strategy,
>> >>> > all
>> >>> > >> >> without
>> >>> > >> >> >> > > > > touching
>> >>> > >> >> >> > > > > > the server.
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > > Of course nothing comes for free. In particular,
>> >>> > >> >> >> > > > > > this
>> >>> > >> change
>> >>> > >> >> >> > removes
>> >>> > >> >> >> > > > the
>> >>> > >> >> >> > > > > > ability of the coordinator to validate that
>> commits
>> >>> > >> >> >> > > > > > are
>> >>> > >> made
>> >>> > >> >> by
>> >>> > >> >> >> > > > consumers
>> >>> > >> >> >> > > > > > who were assigned the respective partition. This
>> >>> > >> >> >> > > > > > might
>> >>> > not
>> >>> > >> be
>> >>> > >> >> too
>> >>> > >> >> >> > bad
>> >>> > >> >> >> > > > > since
>> >>> > >> >> >> > > > > > we retain the ability to validate the generation
>> id,
>> >>> > but it
>> >>> > >> >> is a
>> >>> > >> >> >> > > > > potential
>> >>> > >> >> >> > > > > > concern. We have considered alternative protocols
>> >>> > >> >> >> > > > > > which
>> >>> > >> add a
>> >>> > >> >> >> > second
>> >>> > >> >> >> > > > > > round-trip to the protocol in order to give the
>> >>> > coordinator
>> >>> > >> >> the
>> >>> > >> >> >> > > ability
>> >>> > >> >> >> > > > > to
>> >>> > >> >> >> > > > > > confirm the assignment. As mentioned above, the
>> >>> > >> coordinator is
>> >>> > >> >> >> > > somewhat
>> >>> > >> >> >> > > > > > limited in what it can actually validate, but
>> this
>> >>> > >> >> >> > > > > > would
>> >>> > >> >> return
>> >>> > >> >> >> its
>> >>> > >> >> >> > > > > ability
>> >>> > >> >> >> > > > > > to validate commits. The tradeoff is that it
>> >>> > >> >> >> > > > > > increases
>> >>> > the
>> >>> > >> >> >> > protocol's
>> >>> > >> >> >> > > > > > complexity which means more ways for the
>> protocol to
>> >>> > fail
>> >>> > >> and
>> >>> > >> >> >> > > > > consequently
>> >>> > >> >> >> > > > > > more edge cases in the code.
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > > It also misses an opportunity to generalize the
>> >>> > >> >> >> > > > > > group
>> >>> > >> >> membership
>> >>> > >> >> >> > > > protocol
>> >>> > >> >> >> > > > > > for additional use cases. In fact, after you've
>> gone
>> >>> > >> >> >> > > > > > to
>> >>> > the
>> >>> > >> >> >> trouble
>> >>> > >> >> >> > > of
>> >>> > >> >> >> > > > > > moving assignment to the client, the main thing
>> that
>> >>> > >> >> >> > > > > > is
>> >>> > >> left
>> >>> > >> >> in
>> >>> > >> >> >> > this
>> >>> > >> >> >> > > > > > protocol is basically a general group management
>> >>> > >> capability.
>> >>> > >> >> This
>> >>> > >> >> >> > is
>> >>> > >> >> >> > > > > > exactly what is needed for a few cases that are
>> >>> > currently
>> >>> > >> >> under
>> >>> > >> >> >> > > > > discussion
>> >>> > >> >> >> > > > > > (e.g. copycat or single-writer producer). We've
>> >>> > >> >> >> > > > > > taken
>> >>> > this
>> >>> > >> >> >> further
>> >>> > >> >> >> > > step
>> >>> > >> >> >> > > > > in
>> >>> > >> >> >> > > > > > the proposal and attempted to envision what that
>> >>> > >> >> >> > > > > > general
>> >>> > >> >> protocol
>> >>> > >> >> >> > > might
>> >>> > >> >> >> > > > > > look like and how it could be used both by the
>> >>> > >> >> >> > > > > > consumer
>> >>> > and
>> >>> > >> >> for
>> >>> > >> >> >> > some
>> >>> > >> >> >> > > of
>> >>> > >> >> >> > > > > > these other cases.
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > > Anyway, since time is running out on the new
>> >>> > >> >> >> > > > > > consumer,
>> >>> > we
>> >>> > >> have
>> >>> > >> >> >> > > perhaps
>> >>> > >> >> >> > > > > one
>> >>> > >> >> >> > > > > > last chance to consider a significant change in
>> the
>> >>> > >> protocol
>> >>> > >> >> like
>> >>> > >> >> >> > > this,
>> >>> > >> >> >> > > > > so
>> >>> > >> >> >> > > > > > have a look at the wiki and share your thoughts.
>> >>> > >> >> >> > > > > > I've no
>> >>> > >> doubt
>> >>> > >> >> >> that
>> >>> > >> >> >> > > > some
>> >>> > >> >> >> > > > > > ideas seem clearer in my mind than they do on
>> paper,
>> >>> > >> >> >> > > > > > so
>> >>> > ask
>> >>> > >> >> >> > questions
>> >>> > >> >> >> > > > if
>> >>> > >> >> >> > > > > > there is any confusion.
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > > > Thanks!
>> >>> > >> >> >> > > > > > Jason
>> >>> > >> >> >> > > > > >
>> >>> > >> >> >> > > > >
>> >>> > >> >> >> > > >
>> >>> > >> >> >> > >
>> >>> > >> >> >> >
>> >>> > >> >> >>
>> >>> > >> >> >
>> >>> > >> >> >
>> >>> > >> >> >
>> >>> > >> >> > --
>> >>> > >> >> > Thanks,
>> >>> > >> >> > Neha
>> >>> > >> >>
>> >>> > >>
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Sent from Gmail Mobile
>> >>
>> >>
>> >>
>> >> --
>> >> Sent from Gmail Mobile
>> >
>> >
>>
>
>

Re: [DISCUSS] Client-side Assignment for New Consumer

Reply via email to