Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

Dong Lin Wed, 04 Apr 2018 16:02:35 -0700

Thanks Jun! The time works for me.


On Thu, 5 Apr 2018 at 4:34 AM Jun Rao <j...@confluent.io> wrote:

> Hi, Jan, Dong, John, Guozhang,
>
> Perhaps it will be useful to have a KIP meeting to discuss this together as
> a group. Would Apr. 9 (Monday) at 9:00am PDT work? If so, I will send out
> an invite to the mailing list.
>
> Thanks,
>
> Jun
>
>
> On Wed, Apr 4, 2018 at 1:25 AM, Jan Filipiak <jan.filip...@trivago.com>
> wrote:
>
> > Want to quickly step in here again because it is going places again.
> >
> > The last part of the discussion is just a pain to read and completely
> > diverged from what I suggested without making the reasons clear to me.
> >
> > I don't know why this happens.... here are my comments anyway.
> >
> > @Guozhang: That Streams is working on automatic creating
> > copartition-usuable topics: great for streams, has literally nothing todo
> > with the KIP as we want to grow the
> > input topic. Everyone can reshuffle rel. easily but that is not what we
> > need todo, we need to grow the topic in question. After streams
> > automatically reshuffled the input topic
> > still has the same size and it didn't help a bit. I fail to see why this
> > is relevant. What am i missing here?
> >
> > @Dong
> > I am still on the position that the current proposal brings us into the
> > wrong direction. Especially introducing PartitionKeyRebalanceListener
> > From this point we can never move away to proper state full handling
> > without completely deprecating this creature from hell again.
> > Linear hashing is not the optimising step we have todo here. An interface
> > that when a topic is a topic its always the same even after it had
> > grown or shrunk is important. So from my POV I have major concerns that
> > this KIP is benefitial in its current state.
> >
> > What is it that makes everyone so addicted to the idea of linear hashing?
> > not attractive at all for me.
> > And with statefull consumers still a complete mess. Why not stick with
> the
> > Kappa architecture???
> >
> >
> >
> >
> >
> > On 03.04.2018 17:38, Dong Lin wrote:
> >
> >> Hey John,
> >>
> >> Thanks much for your comments!!
> >>
> >> I have yet to go through the emails of John/Jun/Guozhang in detail. But
> >> let
> >> me present my idea for how to minimize the delay for state loading for
> >> stream use-case.
> >>
> >> For ease of understanding, let's assume that the initial partition
> number
> >> of input topics and change log topic are both 10. And initial number of
> >> stream processor is also 10. If we only increase initial partition
> number
> >> of input topics to 15 without changing number of stream processor, the
> >> current KIP already guarantees in-order delivery and no state needs to
> be
> >> moved between consumers for stream use-case. Next, let's say we want to
> >> increase the number of processor to expand the processing capacity for
> >> stream use-case. This requires us to move state between processors which
> >> will take time. Our goal is to minimize the impact (i.e. delay) for
> >> processing while we increase the number of processors.
> >>
> >> Note that stream processor generally includes both consumer and
> producer.
> >> In addition to consume from the input topic, consumer may also need to
> >> consume from change log topic on startup for recovery. And producer may
> >> produce state to the change log topic.
> >>
> >>
> > The solution will include the following steps:
> >>
> >> 1) Increase partition number of the input topic from 10 to 15. Since the
> >> messages with the same key will still go to the same consume before and
> >> after the partition expansion, this step can be done without having to
> >> move
> >> state between processors.
> >>
> >> 2) Increase partition number of the change log topic from 10 to 15. Note
> >> that this step can also be done without impacting existing workflow.
> After
> >> we increase partition number of the change log topic, key space may
> split
> >> and some key will be produced to the newly-added partition. But the same
> >> key will still go to the same processor (i.e. consumer) before and after
> >> the partition. Thus this step can also be done without having to move
> >> state
> >> between processors.
> >>
> >> 3) Now, let's add 5 new consumers whose groupId is different from the
> >> existing processor's groupId. Thus these new consumers will not impact
> >> existing workflow. Each of these new consumers should consume two
> >> partitions from the earliest offset, where these two partitions are the
> >> same partitions that will be consumed if the consumers have the same
> >> groupId as the existing processor's groupId. For example, the first of
> the
> >> five consumers will consume partition 0 and partition 10. The purpose of
> >> these consumers is to rebuild the state (e.g. RocksDB) for the
> processors
> >> in advance. Also note that, by design of the current KIP, each consume
> >> will
> >> consume the existing partition of the change log topic up to the offset
> >> before the partition expansion. Then they will only need to consume the
> >> state of the new partition of the change log topic.
> >>
> >> 4) After consumers have caught up in step 3), we should stop these
> >> consumers and add 5 new processors to the stream processing job. These 5
> >> new processors should run in the same location as the previous 5
> consumers
> >> to re-use the state (e.g. RocksDB). And these processors' consumers
> should
> >> consume partitions of the change log topic from the committed offset the
> >> previous 5 consumers so that no state is missed.
> >>
> >> One important trick to note here is that, the mapping from partition to
> >> consumer should also use linear hashing. And we need to remember the
> >> initial number of processors in the job, 10 in this example, and use
> this
> >> number in the linear hashing algorithm. This is pretty much the same as
> >> how
> >> we use linear hashing to map key to partition. In this case, we get an
> >> identity map from partition -> processor, for both input topic and the
> >> change log topic. For example, processor 12 will consume partition 12 of
> >> the input topic and produce state to the partition 12 of the change log
> >> topic.
> >>
> >> There are a few important properties of this solution to note:
> >>
> >> - We can increase the number of partitions for input topic and the
> change
> >> log topic in any order asynchronously.
> >> - The expansion of the processors in a given job in step 4) only
> requires
> >> the step 3) for the same job. It does not require coordination across
> >> different jobs for step 3) and 4). Thus different jobs can independently
> >> expand there capacity without waiting for each other.
> >> - The logic for 1) and 2) is already supported in the current KIP. The
> >> logic for 3) and 4) appears to be independent of the core Kafka logic
> and
> >> can be implemented separately outside core Kafka. Thus the current KIP
> is
> >> probably sufficient after we agree on the efficiency and the correctness
> >> of
> >> the solution. We can have a separate KIP for Kafka Stream to support 3)
> >> and
> >> 4).
> >>
> >>
> >> Cheers,
> >> Dong
> >>
> >>
> >> On Mon, Apr 2, 2018 at 3:25 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >>
> >> Hey guys, just sharing my two cents here (I promise it will be shorter
> >>> than
> >>> John's article :).
> >>>
> >>> 0. Just to quickly recap, the main discussion point now is how to
> support
> >>> "key partitioning preservation" (John's #4 in topic characteristics
> >>> above)
> >>> beyond the "single-key ordering preservation" that KIP-253 was
> originally
> >>> proposed to maintain (John's #6 above).
> >>>
> >>> 1. From the streams project, we are actively working on improving the
> >>> elastic scalability of the library. One of the key features is to
> >>> decouple
> >>> the input topics from the parallelism model of Streams: i.e. not
> >>> enforcing
> >>> the topic to be partitioned by the key, not enforcing joining topics to
> >>> be
> >>> co-partitioned, not relying the number of parallel tasks on the input
> >>> topic
> >>> partitions. This can be achieved by re-shuffling on the input topics to
> >>> make sure key-partitioning / co-partitioning on the internal topics.
> Note
> >>> the re-shuffling task is purely stateless and hence does not require
> "key
> >>> partitioning preservation". Operational-wise it is similar to the
> >>> "creating
> >>> a new topic with new number of partitions, pipe the data to the new
> topic
> >>> and cut over consumers from old topics" idea, just that users can
> >>> optionally let Streams to handle such rather than doing it manually
> >>> themselves. There are a few more details on that regard but I will skip
> >>> since they are not directly related to this discussion.
> >>>
> >>> 2. Assuming that 1) above is done, then the only topics involved in the
> >>> scaling events are all input topics. For these topics the only
> producers
> >>> /
> >>> consumers of these topics are controlled by Streams clients themselves,
> >>> and
> >>> hence achieving "key partitioning preservation" is simpler than
> >>> non-Streams
> >>> scenarios: consumers know the partitioning scheme that producers are
> >>> using,
> >>> so that for their stateful operations it is doable to split the local
> >>> state
> >>> stores accordingly or execute backfilling on its own. Of course, if we
> >>> decide to do server-side backfilling, it can still help Streams to
> >>> directly
> >>> rely on that functionality.
> >>>
> >>> 3. As John mentioned, another way inside Streams is to do
> >>> over-partitioning
> >>> on all internal topics; then with 1) Streams would not rely on KIP-253
> at
> >>> all. But personally I'd like to avoid it if possible to reduce Kafka
> side
> >>> footprint: say we overpartition each input topic up to 1k, with a
> >>> reasonable sized stateful topology it can still contribute to tens of
> >>> thousands of topics to the topic partition capacity of a single
> cluster.
> >>>
> >>> 4. Summing up 1/2/3, I think we should focus more on non-Streams users
> >>> writing their stateful computations with local states, and think
> whether
> >>> /
> >>> how we could enable "key partitioning preservation" for them easily,
> than
> >>> to think heavily for Streams library. People may have different opinion
> >>> on
> >>> how common of a usage pattern it is (I think Jun might be suggesting
> that
> >>> for DIY apps people may more likely use remote states so that it is
> not a
> >>> problem for them). My opinion is that for non-Streams users such usage
> >>> pattern could still be large (think: if you are piping data from Kafka
> to
> >>> an external data storage which has single-writer requirements for each
> >>> single shard, even though it is not a stateful computational
> application
> >>> it
> >>> may still require "key partitioning preservation"), so I prefer to have
> >>> backfilling in our KIP than only exposing the API for expansion and
> >>> requires consumers to have pre-knowledge of the producer's partitioning
> >>> scheme.
> >>>
> >>>
> >>>
> >>> Guozhang
> >>>
> >>>
> >>>
> >>> On Thu, Mar 29, 2018 at 2:33 PM, John Roesler <j...@confluent.io>
> wrote:
> >>>
> >>> Hey Dong,
> >>>>
> >>>> Congrats on becoming a committer!!!
> >>>>
> >>>> Since I just sent a novel-length email, I'll try and keep this one
> brief
> >>>>
> >>> ;)
> >>>
> >>>> Regarding producer coordination, I'll grant that in that case,
> producers
> >>>> may coordinate among themselves to produce into the same topic or to
> >>>> produce co-partitioned topics. Nothing in KStreams or the Kafka
> >>>> ecosystem
> >>>> in general requires such coordination for correctness or in fact for
> any
> >>>> optional features, though, so I would not say that we require producer
> >>>> coordination of partition logic. If producers currently coordinate,
> it's
> >>>> completely optional and their own choice.
> >>>>
> >>>> Regarding the portability of partition algorithms, my observation is
> >>>> that
> >>>> systems requiring independent implementations of the same algorithm
> with
> >>>> 100% correctness are a large source of risk and also a burden on those
> >>>>
> >>> who
> >>>
> >>>> have to maintain them. If people could flawlessly implement algorithms
> >>>> in
> >>>> actual software, the world would be a wonderful place indeed! For a
> >>>>
> >>> system
> >>>
> >>>> as important and widespread as Kafka, I would recommend restricting
> >>>> limiting such requirements as aggressively as possible.
> >>>>
> >>>> I'd agree that we can always revisit decisions like allowing arbitrary
> >>>> partition functions, but of course, we shouldn't do that in a vacuum.
> >>>>
> >>> That
> >>>
> >>>> feels like the kind of thing we'd need to proactively seek guidance
> from
> >>>> the users list about. I do think that the general approach of saying
> >>>> that
> >>>> "if you use a custom partitioner, you cannot do partition expansion"
> is
> >>>> very reasonable (but I don't think we need to go that far with the
> >>>>
> >>> current
> >>>
> >>>> proposal). It's similar to my statement in my email to Jun that in
> >>>> principle KStreams doesn't *need* backfill, we only need it if we want
> >>>> to
> >>>> employ partition expansion.
> >>>>
> >>>> I reckon that the main motivation for backfill is to support KStreams
> >>>> use
> >>>> cases and also any other use cases involving stateful consumers.
> >>>>
> >>>> Thanks for your response, and congrats again!
> >>>> -John
> >>>>
> >>>>
> >>>> On Wed, Mar 28, 2018 at 1:34 AM, Dong Lin <lindon...@gmail.com>
> wrote:
> >>>>
> >>>> Hey John,
> >>>>>
> >>>>> Great! Thanks for all the comment. It seems that we agree that the
> >>>>>
> >>>> current
> >>>>
> >>>>> KIP is in good shape for core Kafka. IMO, what we have been
> discussing
> >>>>>
> >>>> in
> >>>
> >>>> the recent email exchanges is mostly about the second step, i.e. how
> to
> >>>>> address problem for the stream use-case (or stateful processing in
> >>>>> general).
> >>>>>
> >>>>> I will comment inline.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 27, 2018 at 4:38 PM, John Roesler <j...@confluent.io>
> >>>>>
> >>>> wrote:
> >>>
> >>>> Thanks for the response, Dong.
> >>>>>>
> >>>>>> Here are my answers to your questions:
> >>>>>>
> >>>>>> - "Asking producers and consumers, or even two different producers,
> >>>>>>
> >>>>> to
> >>>
> >>>> share code like the partition function is a pretty huge ask. What
> >>>>>>>
> >>>>>> if
> >>>
> >>>> they
> >>>>>
> >>>>>> are using different languages?". It seems that today we already
> >>>>>>>
> >>>>>> require
> >>>>
> >>>>> different producer's to use the same hash function -- otherwise
> >>>>>>>
> >>>>>> messages
> >>>>>
> >>>>>> with the same key will go to different partitions of the same topic
> >>>>>>>
> >>>>>> which
> >>>>>
> >>>>>> may cause problem for downstream consumption. So not sure if it
> >>>>>>>
> >>>>>> adds
> >>>
> >>>> any
> >>>>>
> >>>>>> more constraint by assuming consumers know the hash function of
> >>>>>>>
> >>>>>> producer.
> >>>>>
> >>>>>> Could you explain more why user would want to use a cusmtom
> >>>>>>>
> >>>>>> partition
> >>>
> >>>> function? Maybe we can check if this is something that can be
> >>>>>>>
> >>>>>> supported
> >>>>
> >>>>> in
> >>>>>>
> >>>>>>> the default Kafka hash function. Also, can you explain more why it
> >>>>>>>
> >>>>>> is
> >>>
> >>>> difficuilt to implement the same hash function in different
> >>>>>>>
> >>>>>> languages?
> >>>>
> >>>>>
> >>>>>> Sorry, I meant two different producers as in producers to two
> >>>>>>
> >>>>> different
> >>>
> >>>> topics. This was in response to the suggestion that we already
> >>>>>>
> >>>>> require
> >>>
> >>>> coordination among producers to different topics in order to achieve
> >>>>>> co-partitioning. I was saying that we do not (and should not).
> >>>>>>
> >>>>>
> >>>>> It is probably common for producers of different team to produce
> >>>>>
> >>>> message
> >>>
> >>>> to
> >>>>
> >>>>> the same topic. In order to ensure that messages with the same key go
> >>>>>
> >>>> to
> >>>
> >>>> same partition, we need producers of different team to share the same
> >>>>> partition algorithm, which by definition requires coordination among
> >>>>> producers of different teams in an organization. Even for producers
> of
> >>>>> different topics, it may be common to require producers to use the
> same
> >>>>> partition algorithm in order to join two topics for stream
> processing.
> >>>>>
> >>>> Does
> >>>>
> >>>>> this make it reasonable to say we already require coordination across
> >>>>> producers?
> >>>>>
> >>>>>
> >>>>> By design, consumers are currently ignorant of the partitioning
> >>>>>>
> >>>>> scheme.
> >>>
> >>>> It
> >>>>>
> >>>>>> suffices to trust that the producer has partitioned the topic by
> key,
> >>>>>>
> >>>>> if
> >>>>
> >>>>> they claim to have done so. If you don't trust that, or even if you
> >>>>>>
> >>>>> just
> >>>>
> >>>>> need some other partitioning scheme, then you must re-partition it
> >>>>>> yourself. Nothing we're discussing can or should change that. The
> >>>>>>
> >>>>> value
> >>>
> >>>> of
> >>>>>
> >>>>>> backfill is that it preserves the ability for consumers to avoid
> >>>>>> re-partitioning before consuming, in the case where they don't need
> >>>>>>
> >>>>> to
> >>>
> >>>> today.
> >>>>>>
> >>>>>
> >>>>> Regarding shared "hash functions", note that it's a bit inaccurate to
> >>>>>>
> >>>>> talk
> >>>>>
> >>>>>> about the "hash function" of the producer. Properly speaking, the
> >>>>>>
> >>>>> producer
> >>>>>
> >>>>>> has only a "partition function". We do not know that it is a hash.
> >>>>>>
> >>>>> The
> >>>
> >>>> producer can use any method at their disposal to assign a partition
> >>>>>>
> >>>>> to
> >>>
> >>>> a
> >>>>
> >>>>> record. The partition function obviously may we written in any
> >>>>>>
> >>>>> programming
> >>>>>
> >>>>>> language, so in general it's not something that can be shared around
> >>>>>> without a formal spec or the ability to execute arbitrary
> executables
> >>>>>>
> >>>>> in
> >>>>
> >>>>> arbitrary runtime environments.
> >>>>>>
> >>>>>> Yeah it is probably better to say partition algorithm. I guess it
> >>>>>
> >>>> should
> >>>
> >>>> not be difficult to implement same partition algorithms in different
> >>>>> languages, right? Yes we would need a formal specification of the
> >>>>>
> >>>> default
> >>>
> >>>> partition algorithm in the producer. I think that can be documented as
> >>>>>
> >>>> part
> >>>>
> >>>>> of the producer interface.
> >>>>>
> >>>>>
> >>>>> Why would a producer want a custom partition function? I don't
> >>>>>>
> >>>>> know...
> >>>
> >>>> why
> >>>>>
> >>>>>> did we design the interface so that our users can provide one? In
> >>>>>>
> >>>>> general,
> >>>>>
> >>>>>> such systems provide custom partitioners because some data sets may
> >>>>>>
> >>>>> be
> >>>
> >>>> unbalanced under the default or because they can provide some
> >>>>>>
> >>>>> interesting
> >>>>
> >>>>> functionality built on top of the partitioning scheme, etc. Having
> >>>>>>
> >>>>> provided
> >>>>>
> >>>>>> this ability, I don't know why we would remove it.
> >>>>>>
> >>>>>> Yeah it is reasonable to assume that there was reason to support
> >>>>> custom
> >>>>> partition function in producer. On the other hand it may also be
> >>>>>
> >>>> reasonable
> >>>>
> >>>>> to revisit this interface and discuss whether we actually need to
> >>>>>
> >>>> support
> >>>
> >>>> custom partition function. If we don't have a good reason, we can
> >>>>>
> >>>> choose
> >>>
> >>>> not to support custom partition function in this KIP in a backward
> >>>>> compatible manner, i.e. user can still use custom partition function
> >>>>>
> >>>> but
> >>>
> >>>> they would not get the benefit of in-order delivery when there is
> >>>>>
> >>>> partition
> >>>>
> >>>>> expansion. What do you think?
> >>>>>
> >>>>>
> >>>>> - Besides the assumption that consumer needs to share the hash
> >>>>>>
> >>>>> function
> >>>
> >>>> of
> >>>>>
> >>>>>> producer, is there other organization overhead of the proposal in
> >>>>>>>
> >>>>>> the
> >>>
> >>>> current KIP?
> >>>>>>>
> >>>>>>> It wasn't clear to me that KIP-253 currently required the producer
> >>>>>>
> >>>>> and
> >>>
> >>>> consumer to share the partition function, or in fact that it had a
> >>>>>>
> >>>>> hard
> >>>
> >>>> requirement to abandon the general partition function and use a
> >>>>>>
> >>>>> linear
> >>>
> >>>> hash
> >>>>>
> >>>>>> function instead.
> >>>>>>
> >>>>>
> >>>>> In my reading, there is a requirement to track the metadata about
> >>>>>>
> >>>>> what
> >>>
> >>>> partitions split into what other partitions during an expansion
> >>>>>>
> >>>>> operation.
> >>>>>
> >>>>>> If the partition function is linear, this is easy. If not, you can
> >>>>>>
> >>>>> always
> >>>>
> >>>>> just record that all old partitions split into all new partitions.
> >>>>>>
> >>>>> This
> >>>
> >>>> has
> >>>>>
> >>>>>> the effect of forcing all consumers to wait until the old epoch is
> >>>>>> completely consumed before starting on the new epoch. But this may
> >>>>>>
> >>>>> be a
> >>>
> >>>> reasonable tradeoff, and it doesn't otherwise alter your design.
> >>>>>>
> >>>>>> You only mention the consumer needing to know that the partition
> >>>>>>
> >>>>> function
> >>>>
> >>>>> is linear, not what the actual function is, so I don't think your
> >>>>>>
> >>>>> design
> >>>>
> >>>>> actually calls for sharing the function. Plus, really all the
> >>>>>>
> >>>>> consumer
> >>>
> >>>> needs is the metadata about what old-epoch partitions to wait for
> >>>>>>
> >>>>> before
> >>>>
> >>>>> consuming a new-epoch partition. This information is directly
> >>>>>>
> >>>>> captured
> >>>
> >>>> in
> >>>>
> >>>>> metadata, so I don't think it actually even cares whether the
> >>>>>>
> >>>>> partition
> >>>
> >>>> function is linear or not.
> >>>>>>
> >>>>>> You are right that the current KIP does not mention it. My comment
> >>>>>
> >>>> related
> >>>>
> >>>>> to the partition function coordination was related to support the
> >>>>> stream-use case which we have been discussing so far.
> >>>>>
> >>>>>
> >>>>> So, no, I really think KIP-253 is in good shape. I was really more
> >>>>>>
> >>>>> talking
> >>>>>
> >>>>>> about the part of this thread that's outside of KIP-253's scope,
> >>>>>>
> >>>>> namely,
> >>>>
> >>>>> creating the possibility of backfilling partitions after expansion.
> >>>>>>
> >>>>>> Great! Can you also confirm that the main motivation for backfilling
> >>>>> partitions after expansion is to support the stream use-case?
> >>>>>
> >>>>>
> >>>>> - Currently producer can forget about the message that has been
> >>>>>>
> >>>>>>> acknowledged by the broker. Thus the producer probably does not
> >>>>>>>
> >>>>>> know
> >>>
> >>>> most
> >>>>>
> >>>>>> of the exiting messages in topic, including those messages produced
> >>>>>>>
> >>>>>> by
> >>>>
> >>>>> other producers. We can have the owner of the producer to
> >>>>>>>
> >>>>>> split+backfill.
> >>>>>
> >>>>>> In my opion it will be a new program that wraps around the existing
> >>>>>>> producer and consumer classes.
> >>>>>>>
> >>>>>>> This sounds fine by me!
> >>>>>>
> >>>>>> Really, I was just emphasizing that the part of the organization
> that
> >>>>>> produces a topic shouldn't have to export their partition function
> to
> >>>>>>
> >>>>> the
> >>>>
> >>>>> part(s) of the organization (or other organizations) that consume the
> >>>>>> topic. Whether the backfill operation goes into the Producer
> >>>>>>
> >>>>> interface
> >>>
> >>>> is
> >>>>
> >>>>> secondary, I think.
> >>>>>>
> >>>>>> - Regarding point 5. The argument is in favor of the split+backfill
> >>>>>>
> >>>>> but
> >>>
> >>>> for
> >>>>>
> >>>>>> changelog topic. And it intends to address the problem for stream
> >>>>>>>
> >>>>>> use-case
> >>>>>>
> >>>>>>> in general. In this KIP we will provide interface (i.e.
> >>>>>>> PartitionKeyRebalanceListener in the KIP) to be used by sream
> >>>>>>>
> >>>>>> use-case
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> the goal is that user can flush/re-consume the state as part of the
> >>>>>>> interface implementation regardless of whether there is change log
> >>>>>>>
> >>>>>> topic.
> >>>>>
> >>>>>> Maybe you are suggesting that the main reason to do split+backfill
> >>>>>>>
> >>>>>> of
> >>>
> >>>> input
> >>>>>>
> >>>>>>> topic is to support log compacted topics? You mentioned in Point 1
> >>>>>>>
> >>>>>> that
> >>>>
> >>>>> log
> >>>>>>
> >>>>>>> compacted topics is out of the scope of this KIP. Maybe I could
> >>>>>>>
> >>>>>> understand
> >>>>>>
> >>>>>>> your position better. Regarding Jan's proposal to split partitions
> >>>>>>>
> >>>>>> with
> >>>>
> >>>>> backfill, do you think this should replace the proposal in the
> >>>>>>>
> >>>>>> existing
> >>>>
> >>>>> KIP, or do you think this is something that we should do in
> >>>>>>>
> >>>>>> addition
> >>>
> >>>> to
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> existing KIP?
> >>>>>>>
> >>>>>>> I think that interface is a good/necessary component of KIP-253.
> >>>>>>
> >>>>>> I personally (FWIW) feel that KIP-253 is appropriately scoped, but I
> >>>>>>
> >>>>> do
> >>>
> >>>> think its utility will be limited unless there is a later KIP
> >>>>>>
> >>>>> offering
> >>>
> >>>> backfill. But, maybe unlike Jan, I think it makes sense to try and
> >>>>>>
> >>>>> tackle
> >>>>
> >>>>> the ordering problem independently of backfill, so I'm in support of
> >>>>>>
> >>>>> the
> >>>>
> >>>>> current KIP.
> >>>>>>
> >>>>>> - Regarding point 6. I guess we can agree that it is better not to
> >>>>>>
> >>>>> have
> >>>
> >>>> the
> >>>>>
> >>>>>> performance overhread of copying the input data. Before we discuss
> >>>>>>>
> >>>>>> more
> >>>>
> >>>>> on
> >>>>>>
> >>>>>>> whether the performance overhead is acceptable or not, I am trying
> >>>>>>>
> >>>>>> to
> >>>
> >>>> figure out what is the benefit of introducing this overhread. You
> >>>>>>>
> >>>>>> mentioned
> >>>>>>
> >>>>>>> that the benefit is the loose organizational coupling. By
> >>>>>>>
> >>>>>> "organizational
> >>>>>
> >>>>>> coupling", are you referring to the requirement that consumer needs
> >>>>>>>
> >>>>>> to
> >>>>
> >>>>> know
> >>>>>>
> >>>>>>> the hash function of producer? If so, maybe we can discuss the
> >>>>>>>
> >>>>>> use-case
> >>>>
> >>>>> of
> >>>>>>
> >>>>>>> custom partiton function and see whether we can find a way to
> >>>>>>>
> >>>>>> support
> >>>
> >>>> such
> >>>>>>
> >>>>>>> use-case without having to copy the input data.
> >>>>>>>
> >>>>>>> I'm not too sure about what an "input" is in this sense, since we
> are
> >>>>>>
> >>>>> just
> >>>>>
> >>>>>> talking about topics. Actually the point I was making there is that
> >>>>>>
> >>>>> AKAICT
> >>>>>
> >>>>>> the performance overhead of a backfill is less than any other
> option,
> >>>>>> assuming you split partitions rarely.
> >>>>>>
> >>>>>> By "input" I was referring to source Kafka topic of a stream
> >>>>> processing
> >>>>> job.
> >>>>>
> >>>>>
> >>>>> Separately, yes, "organizational coupling" increases if producers and
> >>>>>> consumers have to share code, such as the partition function. This
> >>>>>>
> >>>>> would
> >>>>
> >>>>> not be the case if producers could only pick from a menu of a few
> >>>>>> well-known partition functions, but I think this is a poor tradeoff.
> >>>>>>
> >>>>>> Maybe we can revisit the custom partition function and see whether
> we
> >>>>> actually need it? Otherwise, I am concerned that every user will pay
> >>>>>
> >>>> the
> >>>
> >>>> overhead of data movement to support something that was not really
> >>>>>
> >>>> needed
> >>>
> >>>> for most users.
> >>>>>
> >>>>>
> >>>>> To me, this is two strong arguments in favor of backfill being less
> >>>>>> expensive than no backfill, but again, I think that particular
> debate
> >>>>>>
> >>>>> comes
> >>>>>
> >>>>>> after KIP-253, so I don't want to create the impression of
> opposition
> >>>>>>
> >>>>> to
> >>>>
> >>>>> your proposal.
> >>>>>>
> >>>>>>
> >>>>>> Finally, to respond to a new email I just noticed:
> >>>>>>
> >>>>>> BTW, here is my understanding of the scope of this KIP. We want to
> >>>>>>>
> >>>>>> allow
> >>>>>
> >>>>>> consumers to always consume messages with the same key from the
> >>>>>>>
> >>>>>> same
> >>>
> >>>> producer in the order they are produced. And we need to provide a
> >>>>>>>
> >>>>>> way
> >>>
> >>>> for
> >>>>>
> >>>>>> stream use-case to be able to flush/load state when messages with
> >>>>>>>
> >>>>>> the
> >>>
> >>>> same
> >>>>>>
> >>>>>>> key are migrated between consumers. In addition to ensuring that
> >>>>>>>
> >>>>>> this
> >>>
> >>>> goal
> >>>>>>
> >>>>>>> is correctly supported, we should do our best to keep the
> >>>>>>>
> >>>>>> performance
> >>>
> >>>> and
> >>>>>
> >>>>>> organization overhead of this KIP as low as possible.
> >>>>>>>
> >>>>>>> I think we're on the same page there! In fact, I would generalize a
> >>>>>>
> >>>>> little
> >>>>>
> >>>>>> more and say that the mechanism you've designed provides *all
> >>>>>>
> >>>>> consumers*
> >>>>
> >>>>> the ability "to flush/load state when messages with the same key are
> >>>>>> migrated between consumers", not just Streams.
> >>>>>>
> >>>>>> Thanks for all the comment!
> >>>>>
> >>>>>
> >>>>> Thanks for the discussion,
> >>>>>> -John
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 27, 2018 at 3:14 PM, Dong Lin <lindon...@gmail.com>
> >>>>>>
> >>>>> wrote:
> >>>
> >>>> Hey John,
> >>>>>>>
> >>>>>>> Thanks much for the detailed comments. Here are my thoughts:
> >>>>>>>
> >>>>>>> - The need to delete messages from log compacted topics is mainly
> >>>>>>>
> >>>>>> for
> >>>
> >>>> performance (e.g. storage space) optimization than for correctness
> >>>>>>>
> >>>>>> for
> >>>>
> >>>>> this
> >>>>>>
> >>>>>>> KIP. I agree that we probably don't need to focus on this in our
> >>>>>>>
> >>>>>> discussion
> >>>>>>
> >>>>>>> since it is mostly for performance optimization.
> >>>>>>>
> >>>>>>> - "Asking producers and consumers, or even two different producers,
> >>>>>>>
> >>>>>> to
> >>>>
> >>>>> share code like the partition function is a pretty huge ask. What
> >>>>>>>
> >>>>>> if
> >>>
> >>>> they
> >>>>>
> >>>>>> are using different languages?". It seems that today we already
> >>>>>>>
> >>>>>> require
> >>>>
> >>>>> different producer's to use the same hash function -- otherwise
> >>>>>>>
> >>>>>> messages
> >>>>>
> >>>>>> with the same key will go to different partitions of the same topic
> >>>>>>>
> >>>>>> which
> >>>>>
> >>>>>> may cause problem for downstream consumption. So not sure if it
> >>>>>>>
> >>>>>> adds
> >>>
> >>>> any
> >>>>>
> >>>>>> more constraint by assuming consumers know the hash function of
> >>>>>>>
> >>>>>> producer.
> >>>>>
> >>>>>> Could you explain more why user would want to use a cusmtom
> >>>>>>>
> >>>>>> partition
> >>>
> >>>> function? Maybe we can check if this is something that can be
> >>>>>>>
> >>>>>> supported
> >>>>
> >>>>> in
> >>>>>>
> >>>>>>> the default Kafka hash function. Also, can you explain more why it
> >>>>>>>
> >>>>>> is
> >>>
> >>>> difficuilt to implement the same hash function in different
> >>>>>>>
> >>>>>> languages?
> >>>>
> >>>>> - Besides the assumption that consumer needs to share the hash
> >>>>>>>
> >>>>>> function
> >>>>
> >>>>> of
> >>>>>>
> >>>>>>> producer, is there other organization overhead of the proposal in
> >>>>>>>
> >>>>>> the
> >>>
> >>>> current KIP?
> >>>>>>>
> >>>>>>> - Currently producer can forget about the message that has been
> >>>>>>> acknowledged by the broker. Thus the producer probably does not
> >>>>>>>
> >>>>>> know
> >>>
> >>>> most
> >>>>>
> >>>>>> of the exiting messages in topic, including those messages produced
> >>>>>>>
> >>>>>> by
> >>>>
> >>>>> other producers. We can have the owner of the producer to
> >>>>>>>
> >>>>>> split+backfill.
> >>>>>
> >>>>>> In my opion it will be a new program that wraps around the existing
> >>>>>>> producer and consumer classes.
> >>>>>>>
> >>>>>>> - Regarding point 5. The argument is in favor of the split+backfill
> >>>>>>>
> >>>>>> but
> >>>>
> >>>>> for
> >>>>>>
> >>>>>>> changelog topic. And it intends to address the problem for stream
> >>>>>>>
> >>>>>> use-case
> >>>>>>
> >>>>>>> in general. In this KIP we will provide interface (i.e.
> >>>>>>> PartitionKeyRebalanceListener in the KIP) to be used by sream
> >>>>>>>
> >>>>>> use-case
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> the goal is that user can flush/re-consume the state as part of the
> >>>>>>> interface implementation regardless of whether there is change log
> >>>>>>>
> >>>>>> topic.
> >>>>>
> >>>>>> Maybe you are suggesting that the main reason to do split+backfill
> >>>>>>>
> >>>>>> of
> >>>
> >>>> input
> >>>>>>
> >>>>>>> topic is to support log compacted topics? You mentioned in Point 1
> >>>>>>>
> >>>>>> that
> >>>>
> >>>>> log
> >>>>>>
> >>>>>>> compacted topics is out of the scope of this KIP. Maybe I could
> >>>>>>>
> >>>>>> understand
> >>>>>>
> >>>>>>> your position better. Regarding Jan's proposal to split partitions
> >>>>>>>
> >>>>>> with
> >>>>
> >>>>> backfill, do you think this should replace the proposal in the
> >>>>>>>
> >>>>>> existing
> >>>>
> >>>>> KIP, or do you think this is something that we should do in
> >>>>>>>
> >>>>>> addition
> >>>
> >>>> to
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> existing KIP?
> >>>>>>>
> >>>>>>> - Regarding point 6. I guess we can agree that it is better not to
> >>>>>>>
> >>>>>> have
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> performance overhread of copying the input data. Before we discuss
> >>>>>>>
> >>>>>> more
> >>>>
> >>>>> on
> >>>>>>
> >>>>>>> whether the performance overhead is acceptable or not, I am trying
> >>>>>>>
> >>>>>> to
> >>>
> >>>> figure out what is the benefit of introducing this overhread. You
> >>>>>>>
> >>>>>> mentioned
> >>>>>>
> >>>>>>> that the benefit is the loose organizational coupling. By
> >>>>>>>
> >>>>>> "organizational
> >>>>>
> >>>>>> coupling", are you referring to the requirement that consumer needs
> >>>>>>>
> >>>>>> to
> >>>>
> >>>>> know
> >>>>>>
> >>>>>>> the hash function of producer? If so, maybe we can discuss the
> >>>>>>>
> >>>>>> use-case
> >>>>
> >>>>> of
> >>>>>>
> >>>>>>> custom partiton function and see whether we can find a way to
> >>>>>>>
> >>>>>> support
> >>>
> >>>> such
> >>>>>>
> >>>>>>> use-case without having to copy the input data.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Dong
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 27, 2018 at 11:34 AM, John Roesler <j...@confluent.io>
> >>>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Dong and Jun,
> >>>>>>>>
> >>>>>>>> Thanks for the thoughtful responses. If you don't mind, I'll mix
> >>>>>>>>
> >>>>>>> my
> >>>
> >>>> replies
> >>>>>>>
> >>>>>>>> together to try for a coherent response. I'm not too familiar
> >>>>>>>>
> >>>>>>> with
> >>>
> >>>> mailing-list etiquette, though.
> >>>>>>>>
> >>>>>>>> I'm going to keep numbering my points because it makes it easy
> >>>>>>>>
> >>>>>>> for
> >>>
> >>>> you
> >>>>>
> >>>>>> all
> >>>>>>>
> >>>>>>>> to respond.
> >>>>>>>>
> >>>>>>>> Point 1:
> >>>>>>>> As I read it, KIP-253 is *just* about properly fencing the
> >>>>>>>>
> >>>>>>> producers
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> consumers so that you preserve the correct ordering of records
> >>>>>>>>
> >>>>>>> during
> >>>>
> >>>>> partition expansion. This is clearly necessary regardless of
> >>>>>>>>
> >>>>>>> anything
> >>>>
> >>>>> else
> >>>>>>>
> >>>>>>>> we discuss. I think this whole discussion about backfill,
> >>>>>>>>
> >>>>>>> consumers,
> >>>>
> >>>>> streams, etc., is beyond the scope of KIP-253. But it would be
> >>>>>>>>
> >>>>>>> cumbersome
> >>>>>>
> >>>>>>> to start a new thread at this point.
> >>>>>>>>
> >>>>>>>> I had missed KIP-253's Proposed Change #9 among all the
> >>>>>>>>
> >>>>>>> details...
> >>>
> >>>> I
> >>>>
> >>>>> think
> >>>>>>>
> >>>>>>>> this is a nice addition to the proposal. One thought is that it's
> >>>>>>>>
> >>>>>>> actually
> >>>>>>>
> >>>>>>>> irrelevant whether the hash function is linear. This is simply an
> >>>>>>>>
> >>>>>>> algorithm
> >>>>>>>
> >>>>>>>> for moving a key from one partition to another, so the type of
> >>>>>>>>
> >>>>>>> hash
> >>>
> >>>> function need not be a precondition. In fact, it also doesn't
> >>>>>>>>
> >>>>>>> matter
> >>>>
> >>>>> whether the topic is compacted or not, the algorithm works
> >>>>>>>>
> >>>>>>> regardless.
> >>>>>
> >>>>>> I think this is a good algorithm to keep in mind, as it might
> >>>>>>>>
> >>>>>>> solve a
> >>>>
> >>>>> variety of problems, but it does have a downside: that the
> >>>>>>>>
> >>>>>>> producer
> >>>
> >>>> won't
> >>>>>>
> >>>>>>> know whether or not K1 was actually in P1, it just knows that K1
> >>>>>>>>
> >>>>>>> was
> >>>>
> >>>>> in
> >>>>>
> >>>>>> P1's keyspace before the new epoch. Therefore, it will have to
> >>>>>>>> pessimistically send (K1,null) to P1 just in case. But the next
> >>>>>>>>
> >>>>>>> time
> >>>>
> >>>>> K1
> >>>>>
> >>>>>> comes along, the producer *also* won't remember that it already
> >>>>>>>>
> >>>>>>> retracted
> >>>>>>
> >>>>>>> K1 from P1, so it will have to send (K1,null) *again*. By
> >>>>>>>>
> >>>>>>> extension,
> >>>>
> >>>>> every
> >>>>>>>
> >>>>>>>> time the producer sends to P2, it will also have to send a
> >>>>>>>>
> >>>>>>> tombstone
> >>>>
> >>>>> to
> >>>>>
> >>>>>> P1,
> >>>>>>>
> >>>>>>>> which is a pretty big burden. To make the situation worse, if
> >>>>>>>>
> >>>>>>> there
> >>>
> >>>> is
> >>>>>
> >>>>>> a
> >>>>>>
> >>>>>>> second split, say P2 becomes P2 and P3, then any key Kx belonging
> >>>>>>>>
> >>>>>>> to
> >>>>
> >>>>> P3
> >>>>>
> >>>>>> will also have to be retracted from P2 *and* P1, since the
> >>>>>>>>
> >>>>>>> producer
> >>>
> >>>> can't
> >>>>>>
> >>>>>>> know whether Kx had been last written to P2 or P1. Over a long
> >>>>>>>>
> >>>>>>> period
> >>>>
> >>>>> of
> >>>>>>
> >>>>>>> time, this clearly becomes a issue, as the producer must send an
> >>>>>>>>
> >>>>>>> arbitrary
> >>>>>>>
> >>>>>>>> number of retractions along with every update.
> >>>>>>>>
> >>>>>>>> In contrast, the proposed backfill operation has an end, and
> >>>>>>>>
> >>>>>>> after
> >>>
> >>>> it
> >>>>
> >>>>> ends,
> >>>>>>>
> >>>>>>>> everyone can afford to forget that there ever was a different
> >>>>>>>>
> >>>>>>> partition
> >>>>>
> >>>>>> layout.
> >>>>>>>>
> >>>>>>>> Really, though, figuring out how to split compacted topics is
> >>>>>>>>
> >>>>>>> beyond
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> scope of KIP-253, so I'm not sure #9 really even needs to be in
> >>>>>>>>
> >>>>>>> this
> >>>>
> >>>>> KIP...
> >>>>>>>
> >>>>>>>> We do need in-order delivery during partition expansion. It would
> >>>>>>>>
> >>>>>>> be
> >>>>
> >>>>> fine
> >>>>>>
> >>>>>>> by me to say that you *cannot* expand partitions of a
> >>>>>>>>
> >>>>>>> log-compacted
> >>>
> >>>> topic
> >>>>>>
> >>>>>>> and call it a day. I think it would be better to tackle that in
> >>>>>>>>
> >>>>>>> another
> >>>>>
> >>>>>> KIP.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Point 2:
> >>>>>>>> Regarding whether the consumer re-shuffles its inputs, this is
> >>>>>>>>
> >>>>>>> always
> >>>>
> >>>>> on
> >>>>>>
> >>>>>>> the table; any consumer who wants to re-shuffle its input is free
> >>>>>>>>
> >>>>>>> to
> >>>>
> >>>>> do
> >>>>>
> >>>>>> so.
> >>>>>>>
> >>>>>>>> But this is currently not required. It's just that the current
> >>>>>>>>
> >>>>>>> high-level
> >>>>>>
> >>>>>>> story with Kafka encourages the use of partitions as a unit of
> >>>>>>>>
> >>>>>>> concurrency.
> >>>>>>>
> >>>>>>>> As long as consumers are single-threaded, they can happily
> >>>>>>>>
> >>>>>>> consume
> >>>
> >>>> a
> >>>>
> >>>>> single
> >>>>>>>
> >>>>>>>> partition without concurrency control of any kind. This is a key
> >>>>>>>>
> >>>>>>> aspect
> >>>>>
> >>>>>> to
> >>>>>>>
> >>>>>>>> this system that lets folks design high-throughput systems on top
> >>>>>>>>
> >>>>>>> of
> >>>>
> >>>>> it
> >>>>>
> >>>>>> surprisingly easily. If all consumers were instead
> >>>>>>>>
> >>>>>>> encouraged/required
> >>>>>
> >>>>>> to
> >>>>>>
> >>>>>>> implement a repartition of their own, then the consumer becomes
> >>>>>>>> significantly more complex, requiring either the consumer to
> >>>>>>>>
> >>>>>>> first
> >>>
> >>>> produce
> >>>>>>>
> >>>>>>>> to its own intermediate repartition topic or to ensure that
> >>>>>>>>
> >>>>>>> consumer
> >>>>
> >>>>> threads have a reliable, high-bandwith channel of communication
> >>>>>>>>
> >>>>>>> with
> >>>>
> >>>>> every
> >>>>>>>
> >>>>>>>> other consumer thread.
> >>>>>>>>
> >>>>>>>> Either of those tradeoffs may be reasonable for a particular user
> >>>>>>>>
> >>>>>>> of
> >>>>
> >>>>> Kafka,
> >>>>>>>
> >>>>>>>> but I don't know if we're in a position to say that they are
> >>>>>>>>
> >>>>>>> reasonable
> >>>>>
> >>>>>> for
> >>>>>>>
> >>>>>>>> *every* user of Kafka.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Point 3:
> >>>>>>>> Regarding Jun's point about this use case, "(3) stateful and
> >>>>>>>>
> >>>>>>> maintaining
> >>>>>>
> >>>>>>> the
> >>>>>>>> states in a local store", I agree that they may use a framework
> >>>>>>>>
> >>>>>>> *like*
> >>>>>
> >>>>>> Kafka Streams, but that is not the same as using Kafka Streams.
> >>>>>>>>
> >>>>>>> This
> >>>>
> >>>>> is
> >>>>>
> >>>>>> why
> >>>>>>>
> >>>>>>>> I think it's better to solve it in Core: because it is then
> >>>>>>>>
> >>>>>>> solved
> >>>
> >>>> for
> >>>>>
> >>>>>> KStreams and also for everything else that facilitates local
> >>>>>>>>
> >>>>>>> state
> >>>
> >>>> maintenance. To me, Streams is a member of the category of
> >>>>>>>>
> >>>>>>> "stream
> >>>
> >>>> processing frameworks", which is itself a subcategory of "things
> >>>>>>>>
> >>>>>>> requiring
> >>>>>>>
> >>>>>>>> local state maintenence". I'm not sure if it makes sense to
> >>>>>>>
> >>>>>>>
>

Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion

Reply via email to