> it seems like we do want to allow >> people to optionally specify a partition count as part of this >> operation, but we don't want that option to _force_ repartitioning
Correct, ie, that is my suggestions. > "Use P partitions if repartitioning is necessary" I disagree here, because my reasoning is that: - if a user cares about the number of partition, the user wants to enforce a repartitioning - if a user does not case about the number of partitions, we don't need to provide them a way to pass in a "hint" Hence, it should be sufficient to support: // user does not care `stream.groupByKey(Grouped)` `stream.grouBy(..., Grouped)` // user does care `stream.repartition(Repartitioned).groupByKey()` `streams.groupBy(..., Repartitioned)` -Matthias On 11/9/19 8:10 PM, John Roesler wrote: > Thanks for those thoughts, Matthias, > > I find your reasoning about the optimization behavior compelling. The > `through` operation is very simple and clear to reason about. It just > passes the data exactly at the specified point in the topology exactly > through the specified topic. Likewise, if a user invokes a > `repartition` operator, the simplest behavior is if we just do what > they asked for. > > Stepping back to think about when optimizations are surprising and > when they aren't, it occurs to me that we should be free to move > around repartitions when users have asked to perform some operation > that implies a repartition, like "change keys, then filter, then > aggregate". This program requires a repartition, but it could be > anywhere between the key change and the aggregation. On the other > hand, if they say, "change keys, then filter, then repartition, then > aggregate", it seems like they were pretty clear about their desire, > and we should just take it at face value. > > So, I'm sold on just literally doing a repartition every time they > invoke the `repartition` operator. > > > The "partition count" modifier for `groupBy`/`groupByKey` is more nuanced. > > What you said about `groupByKey` makes sense. If they key hasn't > actually changed, then we don't need to repartition before > aggregating. On the other hand, `groupBy` is specifically changing the > key as part of the grouping operation, so (as you said) we definitely > have to do a repartition. > > If I'm reading the discussion right, it seems like we do want to allow > people to optionally specify a partition count as part of this > operation, but we don't want that option to _force_ repartitioning if > it's not needed. That last clause is the key. "Use P partitions if > repartitioning is necessary" is a directive that applies cleanly and > correctly to both `groupBy` and `groupByKey`. What if we call the > option `numberOfPartitionsHint`, which along with the "if necessary" > javadoc, should make it clear that the option won't force a > repartition, and also gives us enough latitude to still employ the > optimizer on those repartition topics? > > If we like the idea of expressing it as a "hint" for grouping and a > "command" for `repartition`, then it seems like it still makes sense > to keep Grouped and Repartitioned separate, as they would actually > offer different methods with distinct semantics. > > WDYT? > > Thanks, > -John > > On Sat, Nov 9, 2019 at 8:28 PM Matthias J. Sax <matth...@confluent.io> wrote: >> >> Sorry for late reply. >> >> I guess, the question boils down to the intended semantics of >> `repartition()`. My understanding is as follows: >> >> - KS does auto-repartitioning for correctness reasons (using the >> upstream topic to determine the number of partitions) >> - KS does auto-repartitioning only for downstream DSL operators like >> `count()` (eg, a `transform()` does never trigger an auto-repartitioning >> even if the stream is marked as `repartitioningRequired`). >> - KS offers `through()` to enforce a repartitioning -- however, the user >> needs to create the topic manually (with the desired number of partitions). >> >> I see two main applications for `repartitioning()`: >> >> 1) repartition data before a `transform()` but user does not want to >> manage the topic >> 2) scale out a downstream subtopology >> >> Hence, I see `repartition()` similar to `through()`: if a users calls >> it, a repartitining is enforced, with the difference that KS manages the >> topic and the user does not need to create it. >> >> This behavior makes (1) and (2) possible. >> >>> I think many users would prefer to just say "if there *is* a repartition >>> required at this point in the topology, it should >>> have N partitions" >> >> Because of (2), I disagree. Either a user does not care about scaling >> out, for which case she would not specify the number of partitions. Or a >> user does care, and hence wants to enforce the scale out. I don't think >> that any user would say, "maybe scale out". >> >> Therefore, the optimizer should never ignore the repartition operation. >> As a "consequence" (because repartitioning is expensive) a user should >> make an explicit call to `repartition()` IMHO -- piggybacking an >> enforced repartitioning into `groupByKey()` seems to be "dangerous" >> because it might be too subtle and an "optional scaling out" as laid out >> above does not make sense IMHO. >> >> I am also not worried about "over repartitioning" because the result >> stream would never trigger auto-repartitioning. Only if multiple >> consecutive calls to `repartition()` are made it could be bad -- but >> that's the same with `through()`. In the end, there is always some >> responsibility on the user. >> >> Btw, for `.groupBy()` we know that repartitioning will be required, >> however, for `groupByKey()` it depends if the KStream is marked as >> `repartitioningRequired`. >> >> Hence, for `groupByKey()` it should not be possible for a user to set >> number of partitions IMHO. For `groupBy()` it's a different story, >> because calling >> >> `repartition().groupBy()` >> >> does not achieve what we want. Hence, allowing users to pass in the >> number of users partitions into `groupBy()` does actually makes sense, >> because repartitioning will happen anyway and thus we can piggyback a >> scaling decision. >> >> I think that John has a fair concern about the overloads, however, I am >> not convinced that using `Grouped` to specify the number of partitions >> is intuitive. I double checked `Grouped` and `Repartitioned` and both >> allow to specify a `name` and `keySerde/valueSerde`. Thus, I am >> wondering if we could bridge the gap between both, if we would make >> `Repartitioned extends Grouped`? For this case, we only need >> `groupBy(Grouped)` and a user can pass in both types what seems to make >> the API quite smooth: >> >> `stream.groupBy(..., Grouped...)` >> >> `stream.groupBy(..., Repartitioned...)` >> >> >> Thoughts? >> >> >> -Matthias >> >> >> >> On 11/7/19 10:59 AM, Levani Kokhreidze wrote: >>> Hi Sophie, >>> >>> Thank you for your reply, very insightful. Looking forward hearing others >>> opinion as well on this. >>> >>> Kind regards, >>> Levani >>> >>> >>>> On Nov 6, 2019, at 1:30 AM, Sophie Blee-Goldman <sop...@confluent.io> >>>> wrote: >>>> >>>>> Personally, I think Matthias’s concern is valid, but on the other hand >>>> Kafka Streams has already >>>>> optimizer in place which alters topology independently from user >>>> >>>> I agree (with you) and think this is a good way to put it -- we currently >>>> auto-repartition for the user so >>>> that they don't have to walk through their entire topology and reason about >>>> when and where to place a >>>> `.through` (or the new `.repartition`), so why suddenly force this onto the >>>> user? How certain are we that >>>> users will always get this right? It's easy to imagine that during >>>> development, you write your new app with >>>> correctly placed repartitions in order to use this new feature. During the >>>> course of development you end up >>>> tweaking the topology, but don't remember to review or move the >>>> repartitioning since you're used to Streams >>>> doing this for you. If you use only single-partition topics for testing, >>>> you might not even notice your app is >>>> spitting out incorrect results! >>>> >>>> Anyways, I feel pretty strongly that it would be weird to introduce a new >>>> feature and say that to use it, you can't take >>>> advantage of this other feature anymore. Also, is it possible our >>>> optimization framework could ever include an >>>> optimized repartitioning strategy that is better than what a user could >>>> achieve by manually inserting repartitions? >>>> Do we expect users to have a deep understanding of the best way to >>>> repartition their particular topology, or is it >>>> likely they will end up over-repartitioning either due to missed >>>> optimizations or unnecessary extra repartitions? >>>> I think many users would prefer to just say "if there *is* a repartition >>>> required at this point in the topology, it should >>>> have N partitions" >>>> >>>> As to the idea of adding `numberOfPartitions` to Grouped rather than >>>> adding a new parameter to groupBy, that does seem more in line with the >>>> current syntax so +1 from me >>>> >>>> On Tue, Nov 5, 2019 at 2:07 PM Levani Kokhreidze <levani.co...@gmail.com> >>>> wrote: >>>> >>>>> Hello all, >>>>> >>>>> While https://github.com/apache/kafka/pull/7170 < >>>>> https://github.com/apache/kafka/pull/7170> is under review and it’s >>>>> almost done, I want to resurrect discussion about this KIP to address >>>>> couple of concerns raised by Matthias and John. >>>>> >>>>> As a reminder, idea of the KIP-221 was to allow DSL users control over >>>>> repartitioning and parallelism of sub-topologies by: >>>>> 1) Introducing new KStream#repartition operation which is done in >>>>> https://github.com/apache/kafka/pull/7170 < >>>>> https://github.com/apache/kafka/pull/7170> >>>>> 2) Add new KStream#groupBy(Repartitioned) operation, which is planned to >>>>> be separate PR. >>>>> >>>>> While all agree about general implementation and idea behind >>>>> https://github.com/apache/kafka/pull/7170 < >>>>> https://github.com/apache/kafka/pull/7170> PR, introducing new >>>>> KStream#groupBy(Repartitioned) method overload raised some questions >>>>> during >>>>> the review. >>>>> Matthias raised concern that there can be cases when user uses >>>>> `KStream#groupBy(Repartitioned)` operation, but actual repartitioning may >>>>> not required, thus configuration passed via `Repartitioned` would never be >>>>> applied (Matthias, please correct me if I misinterpreted your comment). >>>>> So instead, if user wants to control parallelism of sub-topologies, he or >>>>> she should always use `KStream#repartition` operation before groupBy. Full >>>>> comment can be seen here: >>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125 < >>>>> https://github.com/apache/kafka/pull/7170#issuecomment-519303125> >>>>> >>>>> On the same topic, John pointed out that, from API design perspective, we >>>>> shouldn’t intertwine configuration classes of different operators between >>>>> one another. So instead of introducing new >>>>> `KStream#groupBy(Repartitioned)` >>>>> for specifying number of partitions for internal topic, we should update >>>>> existing `Grouped` class with `numberOfPartitions` field. >>>>> >>>>> Personally, I think Matthias’s concern is valid, but on the other hand >>>>> Kafka Streams has already optimizer in place which alters topology >>>>> independently from user. So maybe it makes sense if Kafka Streams, >>>>> internally would optimize topology in the best way possible, even if in >>>>> some cases this means ignoring some operator configurations passed by the >>>>> user. Also, I agree with John about API design semantics. If we go through >>>>> with the changes for `KStream#groupBy` operation, it makes more sense to >>>>> add `numberOfPartitions` field to `Grouped` class instead of introducing >>>>> new `KStream#groupBy(Repartitioned)` method overload. >>>>> >>>>> I would really appreciate communities feedback on this. >>>>> >>>>> Kind regards, >>>>> Levani >>>>> >>>>> >>>>> >>>>>> On Oct 17, 2019, at 12:57 AM, Sophie Blee-Goldman <sop...@confluent.io> >>>>> wrote: >>>>>> >>>>>> Hey Levani, >>>>>> >>>>>> I think people are busy with the upcoming 2.4 release, and don't have >>>>> much >>>>>> spare time at the >>>>>> moment. It's kind of a difficult time to get attention on things, but >>>>> feel >>>>>> free to pick up something else >>>>>> to work on in the meantime until things have calmed down a bit! >>>>>> >>>>>> Cheers, >>>>>> Sophie >>>>>> >>>>>> >>>>>> On Wed, Oct 16, 2019 at 11:26 AM Levani Kokhreidze < >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>> >>>>>> wrote: >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> Sorry for bringing this thread again, but I would like to get some >>>>>>> attention on this PR: https://github.com/apache/kafka/pull/7170 < >>>>> https://github.com/apache/kafka/pull/7170> < >>>>>>> https://github.com/apache/kafka/pull/7170 < >>>>> https://github.com/apache/kafka/pull/7170>> >>>>>>> It's been a while now and I would love to move on to other KIPs as well. >>>>>>> Please let me know if you have any concerns. >>>>>>> >>>>>>> Regards, >>>>>>> Levani >>>>>>> >>>>>>> >>>>>>>> On Jul 26, 2019, at 11:25 AM, Levani Kokhreidze < >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Here’s voting thread for this KIP: >>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html < >>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html> < >>>>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html < >>>>> https://www.mail-archive.com/dev@kafka.apache.org/msg99680.html>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Levani >>>>>>>> >>>>>>>>> On Jul 24, 2019, at 11:15 PM, Levani Kokhreidze < >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> >>>>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com>>> wrote: >>>>>>>>> >>>>>>>>> Hi Matthias, >>>>>>>>> >>>>>>>>> Thanks for the suggestion. I Don’t have strong opinion on that one. >>>>>>>>> Agree that avoiding unnecessary method overloads is a good idea. >>>>>>>>> >>>>>>>>> Updated KIP >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Levani >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <matth...@confluent.io >>>>> <mailto:matth...@confluent.io> >>>>>>> <mailto:matth...@confluent.io <mailto:matth...@confluent.io>>> wrote: >>>>>>>>>> >>>>>>>>>> One question: >>>>>>>>>> >>>>>>>>>> Why do we add >>>>>>>>>> >>>>>>>>>>> Repartitioned#with(final String name, final int numberOfPartitions) >>>>>>>>>> >>>>>>>>>> It seems that `#with(String name)`, `#numberOfPartitions(int)` in >>>>>>>>>> combination with `withName()` and `withNumberOfPartitions()` should >>>>> be >>>>>>>>>> sufficient. Users can chain the method calls. >>>>>>>>>> >>>>>>>>>> (I think it's valuable to keep the number of overload small if >>>>>>> possible.) >>>>>>>>>> >>>>>>>>>> Otherwise LGTM. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -Matthias >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/23/19 2:18 PM, Levani Kokhreidze wrote: >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> Thanks all for your feedback. >>>>>>>>>>> I started voting procedure for this KIP. If there’re any other >>>>>>> concerns about this KIP, please let me know. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Levani >>>>>>>>>>> >>>>>>>>>>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze < >>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Matthias, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the suggestion, makes sense. >>>>>>>>>>>> I’ve updated KIP ( >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>>>> ). >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Levani >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax < >>>>> matth...@confluent.io <mailto:matth...@confluent.io> >>>>>>> <mailto:matth...@confluent.io <mailto:matth...@confluent.io>> <mailto: >>>>> matth...@confluent.io <mailto:matth...@confluent.io> <mailto: >>>>>>> matth...@confluent.io <mailto:matth...@confluent.io>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for driving the KIP. >>>>>>>>>>>>> >>>>>>>>>>>>> I agree that users need to be able to specify a partitioning >>>>>>> strategy. >>>>>>>>>>>>> >>>>>>>>>>>>> Sophie raises a fair point about topic configs and producer >>>>>>> configs. My >>>>>>>>>>>>> take is, that consider `Repartitioned` as an "extension" to >>>>>>> `Produced`, >>>>>>>>>>>>> that adds topic configuration, is a good way to think about it and >>>>>>> helps >>>>>>>>>>>>> to keep the API "clean". >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> With regard to method names. I would prefer to avoid >>>>> abbreviations. >>>>>>> Can >>>>>>>>>>>>> we rename: >>>>>>>>>>>>> >>>>>>>>>>>>> `withNumOfPartitions` -> `withNumberOfPartitions` >>>>>>>>>>>>> >>>>>>>>>>>>> Furthermore, it might be good to add some more `static` methods: >>>>>>>>>>>>> >>>>>>>>>>>>> - Repartitioned.with(Serde<K>, Serde<V>) >>>>>>>>>>>>> - Repartitioned.withNumberOfPartitions(int) >>>>>>>>>>>>> - Repartitioned.streamPartitioner(StreamPartitioner) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -Matthias >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote: >>>>>>>>>>>>>> Totally agree. I think in KStream interface it makes sense to >>>>> have >>>>>>> some duplicate configurations between operators in order to keep API >>>>> simple >>>>>>> and usable. >>>>>>>>>>>>>> Also, as more surface API has, harder it is to have proper >>>>>>> backward compatibility. >>>>>>>>>>>>>> While initial idea of keeping topic level configs separate was >>>>>>> exciting, having Repartitioned class encapsulate some producer level >>>>>>> configs makes API more readable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Levani >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman < >>>>>>> sop...@confluent.io <mailto:sop...@confluent.io> <mailto: >>>>> sop...@confluent.io <mailto:sop...@confluent.io>> <mailto: >>>>>>> sop...@confluent.io <mailto:sop...@confluent.io> <mailto: >>>>> sop...@confluent.io <mailto:sop...@confluent.io>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think that is a good point about trying to keep producer level >>>>>>>>>>>>>>> configurations and (repartition) topic level considerations >>>>>>> separate. >>>>>>>>>>>>>>> Number of partitions is definitely purely a topic level >>>>>>> configuration. But >>>>>>>>>>>>>>> on some level, serdes and partitioners are just as much a topic >>>>>>>>>>>>>>> configuration as a producer one. You could have two producers >>>>>>> configured >>>>>>>>>>>>>>> with different serdes and/or partitioners, but if they are >>>>>>> writing to the >>>>>>>>>>>>>>> same topic the result would be very difficult to part. So in a >>>>>>> sense, these >>>>>>>>>>>>>>> are configurations of topics in Streams, not just producers. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Another way to think of it: while the Streams API is not always >>>>>>> true to >>>>>>>>>>>>>>> this, ideally all the relevant configs for an operator are >>>>>>> wrapped into a >>>>>>>>>>>>>>> single object (in this case, Repartitioned). We could instead >>>>>>> split out the >>>>>>>>>>>>>>> fields in common with Produced into a separate parameter to keep >>>>>>> topic and >>>>>>>>>>>>>>> producer level configurations separate, but this increases the >>>>>>> API surface >>>>>>>>>>>>>>> area by a lot. It's much more straightforward to just say "this >>>>> is >>>>>>>>>>>>>>> everything that this particular operator needs" without worrying >>>>>>> about what >>>>>>>>>>>>>>> exactly you're specifying. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I suppose you could alternatively make Produced a field of >>>>>>> Repartitioned, >>>>>>>>>>>>>>> but I don't think we do this kind of composition elsewhere in >>>>>>> Streams at >>>>>>>>>>>>>>> the moment >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze < >>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>> <mailto: >>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Bill, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks a lot for the feedback. >>>>>>>>>>>>>>>> Yes, that makes sense. I’ve updated KIP with >>>>>>> `Repartitioned#partitioner` >>>>>>>>>>>>>>>> configuration. >>>>>>>>>>>>>>>> In the beginning, I wanted to introduce a class for topic level >>>>>>>>>>>>>>>> configuration and keep topic level and producer level >>>>>>> configurations (such >>>>>>>>>>>>>>>> as Produced) separately (see my second email in this thread). >>>>>>>>>>>>>>>> But while looking at the semantics of KStream interface, I >>>>>>> couldn’t really >>>>>>>>>>>>>>>> figure out good operation name for Topic level configuration >>>>>>> class and just >>>>>>>>>>>>>>>> introducing `Topic` config class was kinda breaking the >>>>>>> semantics. >>>>>>>>>>>>>>>> So I think having Repartitioned class which encapsulates topic >>>>>>> and >>>>>>>>>>>>>>>> producer level configurations for internal topics is viable >>>>>>> thing to do. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <bbej...@gmail.com >>>>> <mailto:bbej...@gmail.com> >>>>>>> <mailto:bbej...@gmail.com <mailto:bbej...@gmail.com>> <mailto: >>>>> bbej...@gmail.com <mailto:bbej...@gmail.com> <mailto: >>>>>>> bbej...@gmail.com <mailto:bbej...@gmail.com>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Lavani, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for resurrecting this KIP. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm also a +1 for adding a partition option. In addition to >>>>>>> the reason >>>>>>>>>>>>>>>>> provided by John, my reasoning is: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. Users may want to use something other than hash-based >>>>>>> partitioning >>>>>>>>>>>>>>>>> 2. Users may wish to partition on something different than the >>>>>>> key >>>>>>>>>>>>>>>>> without having to change the key. For example: >>>>>>>>>>>>>>>>> 1. A combination of fields in the value in conjunction with >>>>>>> the key >>>>>>>>>>>>>>>>> 2. Something other than the key >>>>>>>>>>>>>>>>> 3. We allow users to specify a partitioner on Produced hence >>>>> in >>>>>>>>>>>>>>>>> KStream.to and KStream.through, so it makes sense for API >>>>>>> consistency. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just my 2 cents. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Bill >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze < >>>>>>>>>>>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> >>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com>> <mailto: >>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In my mind it makes sense. >>>>>>>>>>>>>>>>>> If we add partitioner configuration to Repartitioned class, >>>>>>> with the >>>>>>>>>>>>>>>>>> combination of specifying number of partitions for internal >>>>>>> topics, user >>>>>>>>>>>>>>>>>> will have opportunity to ensure co-partitioning before join >>>>>>> operation. >>>>>>>>>>>>>>>>>> I think this can be quite powerful feature. >>>>>>>>>>>>>>>>>> Wondering what others think about this? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler < >>>>> j...@confluent.io <mailto:j...@confluent.io> >>>>>>> <mailto:j...@confluent.io <mailto:j...@confluent.io>> <mailto: >>>>> j...@confluent.io <mailto:j...@confluent.io> <mailto: >>>>>>> j...@confluent.io <mailto:j...@confluent.io>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally >>>>>>> sure it >>>>>>>>>>>>>>>>>>> makes sense, but I believe something similar is the >>>>> rationale >>>>>>> for >>>>>>>>>>>>>>>>>>> having the partitioner option in Produced. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze >>>>>>>>>>>>>>>>>>> <levani.co...@gmail.com <mailto:levani.co...@gmail.com> >>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com>> >>>>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hey John, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Oh that’s interesting use-case. >>>>>>>>>>>>>>>>>>>> Do I understand this correctly, in your example I would >>>>>>> first issue >>>>>>>>>>>>>>>>>> repartition(Repartitioned) with proper partitioner that >>>>>>> essentially >>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>>>> be the same as the topic I want to join with and then do the >>>>>>>>>>>>>>>> KStream#join >>>>>>>>>>>>>>>>>> with DSL? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler < >>>>>>> j...@confluent.io <mailto:j...@confluent.io> <mailto:j...@confluent.io >>>>> <mailto:j...@confluent.io>> <mailto:j...@confluent.io <mailto: >>>>> j...@confluent.io> >>>>>>> <mailto:j...@confluent.io <mailto:j...@confluent.io>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hey, all, just to chime in, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I think it might be useful to have an option to specify >>>>> the >>>>>>>>>>>>>>>>>>>>> partitioner. The case I have in mind is that some data may >>>>>>> get >>>>>>>>>>>>>>>>>>>>> repartitioned and then joined with an input topic. If the >>>>>>> right-side >>>>>>>>>>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the >>>>>>>>>>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the >>>>>>> same >>>>>>>>>>>>>>>>>>>>> strategy. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Does that make sense, or did I maybe miss something >>>>>>> important? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze >>>>>>>>>>>>>>>>>>>>> <levani.co...@gmail.com <mailto:levani.co...@gmail.com> >>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com>> >>>>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m >>>>> not >>>>>>> sure >>>>>>>>>>>>>>>> about >>>>>>>>>>>>>>>>>> it yet. >>>>>>>>>>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would >>>>>>> need control >>>>>>>>>>>>>>>>>> over partitioner for internal topics. >>>>>>>>>>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best >>>>>>> how to >>>>>>>>>>>>>>>>>> partition data for internal topics. >>>>>>>>>>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for >>>>>>> topics >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> are created by user In advance. >>>>>>>>>>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to >>>>>>> specify >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> partitioner. >>>>>>>>>>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess >>>>>>> specifying the >>>>>>>>>>>>>>>>>> partitioner can be added as well if there’s agreement on >>>>> this. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman < >>>>>>>>>>>>>>>>>> sop...@confluent.io <mailto:sop...@confluent.io> <mailto: >>>>> sop...@confluent.io <mailto:sop...@confluent.io>> <mailto: >>>>>>> sop...@confluent.io <mailto:sop...@confluent.io>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned >>>>>>> would be a >>>>>>>>>>>>>>>>>> useful >>>>>>>>>>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have >>>>>>>>>>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to >>>>>>> Produced? I'm not >>>>>>>>>>>>>>>>>> sure how >>>>>>>>>>>>>>>>>>>>>>> widely this feature is really used, but seems it should >>>>> be >>>>>>>>>>>>>>>> available >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> repartition topics. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze < >>>>>>>>>>>>>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> >>>>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>>> >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hey Sophie, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> In both cases KStream#repartition and >>>>>>>>>>>>>>>>>> KStream#repartition(Repartitioned) >>>>>>>>>>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams. >>>>>>>>>>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over >>>>>>> the topic >>>>>>>>>>>>>>>>>> such as >>>>>>>>>>>>>>>>>>>>>>>> num of partitions. >>>>>>>>>>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that >>>>> is >>>>>>> missing >>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> current DSL design. >>>>>>>>>>>>>>>>>>>>>>>> Essentially giving user control over parallelism by >>>>>>> configuring >>>>>>>>>>>>>>>> num >>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>> partitions for internal topics. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hope this answers your question. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman < >>>>>>>>>>>>>>>>>> sop...@confluent.io <mailto:sop...@confluent.io> <mailto: >>>>> sop...@confluent.io <mailto:sop...@confluent.io>> <mailto: >>>>>>> sop...@confluent.io <mailto:sop...@confluent.io>>> >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hey Levani, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me >>>>> -- >>>>>>> for the >>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned, >>>>>>> will the >>>>>>>>>>>>>>>>>> topic be >>>>>>>>>>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case >>>>> for >>>>>>> the >>>>>>>>>>>>>>>>>> signature >>>>>>>>>>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be >>>>>>> pre-created? The >>>>>>>>>>>>>>>>>> wording >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method >>>>>>> will >>>>>>>>>>>>>>>>>> auto-create >>>>>>>>>>>>>>>>>>>>>>>>> topics while the other will not. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>>> Sophie >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze < >>>>>>>>>>>>>>>>>>>>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com> >>>>>>> <mailto:levani.co...@gmail.com <mailto:levani.co...@gmail.com> <mailto: >>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>>>> >>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> One more bump about KIP-221 ( >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> < >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> >>>>>>> < >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :) >>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns >>>>> about >>>>>>> this KIP. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze < >>>>>>>>>>>>>>>>>> levani.co...@gmail.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Kind reminder about this KIP: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze < >>>>>>>>>>>>>>>>>>>>>>>> levani.co...@gmail.com >>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:levani.co...@gmail.com>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated >>>>>>> confluence >>>>>>>>>>>>>>>> page >>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>>>>>> the new proposal >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>>>>>>>>>>>>>>>>>>>>> < >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :) >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> King regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze < >>>>>>>>>>>>>>>>>>>>>>>> levani.co...@gmail.com >>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:levani.co...@gmail.com>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Matthias, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic` >>>>>>> class for >>>>>>>>>>>>>>>>>> topic >>>>>>>>>>>>>>>>>>>>>>>>>> configuration for internal operators like >>>>> `groupedBy`. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this >>>>> as >>>>>>> well. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax < >>>>>>>>>>>>>>>>>> matth...@confluent.io >>>>>>>>>>>>>>>>>>>>>>>>>> <mailto:matth...@confluent.io>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for >>>>>>> summarizing >>>>>>>>>>>>>>>>>>>>>>>> everything. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed >>>>>>> already (can't >>>>>>>>>>>>>>>>>> really >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to >>>>>>> refresh the >>>>>>>>>>>>>>>>>>>>>>>>>> discussion. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard >>>>> to >>>>>>> the >>>>>>>>>>>>>>>>>> distinction >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and >>>>> operators >>>>>>> that use >>>>>>>>>>>>>>>>>>>>>>>>>> user-created >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might >>>>> indicate >>>>>>> that >>>>>>>>>>>>>>>>>> leaving >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses >>>>>>> use-defined >>>>>>>>>>>>>>>>>> topics) and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an >>>>>>> operator that >>>>>>>>>>>>>>>>>> manages >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is >>>>>>> one >>>>>>>>>>>>>>>> operator >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but >>>>>>> sometimes >>>>>>>>>>>>>>>> not; a >>>>>>>>>>>>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction >>>>>>> clearer. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am >>>>>>> wondering >>>>>>>>>>>>>>>>>> if the >>>>>>>>>>>>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding >>>>>>> it is >>>>>>>>>>>>>>>>>> semantically >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of >>>>>>> others on >>>>>>>>>>>>>>>>>> this, too. >>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>> am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow >>>>> to >>>>>>>>>>>>>>>> configure >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>>>> certain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we >>>>>>> assume that >>>>>>>>>>>>>>>>>> a topic >>>>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method >>>>>>> instead of >>>>>>>>>>>>>>>>>> a plain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other >>>>> configs, >>>>>>> like >>>>>>>>>>>>>>>>>> replication >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the >>>>>>> need to >>>>>>>>>>>>>>>>>> change the >>>>>>>>>>>>>>>>>>>>>>>>>> "main >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> API". >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like >>>>>>> them >>>>>>>>>>>>>>>> myself. >>>>>>>>>>>>>>>>>> :) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe >>>>> enhancing >>>>>>> Produced >>>>>>>>>>>>>>>>>> with num >>>>>>>>>>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach. >>>>>>> Let me >>>>>>>>>>>>>>>>>> explain >>>>>>>>>>>>>>>>>>>>>>>> why: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this >>>>>>> configuration, >>>>>>>>>>>>>>>>>> this will >>>>>>>>>>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is >>>>>>> the final >>>>>>>>>>>>>>>>>> sink of >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable >>>>> assumption >>>>>>> that user >>>>>>>>>>>>>>>>>> needs >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that >>>>>>> case, having >>>>>>>>>>>>>>>>>> num of >>>>>>>>>>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API >>>>>>> contract, seems >>>>>>>>>>>>>>>>>> like >>>>>>>>>>>>>>>>>>>>>>>>>> Produced is designed to be something that is >>>>>>> explicitly for >>>>>>>>>>>>>>>>>> producer >>>>>>>>>>>>>>>>>>>>>>>> (key >>>>>>>>>>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all >>>>>>> are producer >>>>>>>>>>>>>>>>>>>>>>>> specific >>>>>>>>>>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level >>>>>>>>>>>>>>>>>> configuration. And >>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level >>>>>>> configurations >>>>>>>>>>>>>>>>>> together in >>>>>>>>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>>>>>>> class is the good approach. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like >>>>>>> Produced >>>>>>>>>>>>>>>>>> parameter is >>>>>>>>>>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g >>>>> topics >>>>>>> created >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>> managed >>>>>>>>>>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we >>>>>>> should leave >>>>>>>>>>>>>>>>>> it as >>>>>>>>>>>>>>>>>>>>>>>> it is >>>>>>>>>>>>>>>>>>>>>>>>>> in that case. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we >>>>>>> should >>>>>>>>>>>>>>>>>> distinguish >>>>>>>>>>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should >>>>>>> create and >>>>>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that >>>>>>> should be >>>>>>>>>>>>>>>>>> created in >>>>>>>>>>>>>>>>>>>>>>>>>> advance (e.g KStream#to). >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions >>>>>>> configuration in >>>>>>>>>>>>>>>>>>>>>>>> Produced >>>>>>>>>>>>>>>>>>>>>>>>>> will result in mixing topic and producer level >>>>>>> configuration in >>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in >>>>>>> KStream#through - I >>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>>>>>>>> underline idea is very useful and giving users >>>>>>> possibility to >>>>>>>>>>>>>>>>>> specify >>>>>>>>>>>>>>>>>>>>>>>> num >>>>>>>>>>>>>>>>>>>>>>>>>> of partitions there is even more useful :) >>>>> Considering >>>>>>> arguments >>>>>>>>>>>>>>>>>> against >>>>>>>>>>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two >>>>>>> options >>>>>>>>>>>>>>>>>> here: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and >>>>>>> num of >>>>>>>>>>>>>>>>>> partitions >>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic >>>>> will >>>>>>> be auto >>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final >>>>>>> Produced<K, V> >>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with >>>>>>> specified num of >>>>>>>>>>>>>>>>>>>>>>>> partitions >>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce >>>>>>> new method >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this >>>>>>> in one of >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> threads) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the >>>>>>> following >>>>>>>>>>>>>>>> plan: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option A: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to >>>>> Grouped >>>>>>> class (as >>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to >>>>>>> KStream#through >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and >>>>>>> num of >>>>>>>>>>>>>>>>>> partitions >>>>>>>>>>>>>>>>>>>>>>>>>> will be taken from source topic >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic >>>>> will >>>>>>> be auto >>>>>>>>>>>>>>>>>>>>>>>>>> generated with specified num of partitions >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final >>>>>>> Produced<K, V> >>>>>>>>>>>>>>>>>>>>>>>>>> produced) - topic will be with generated with >>>>>>> specified num of >>>>>>>>>>>>>>>>>>>>>>>> partitions >>>>>>>>>>>>>>>>>>>>>>>>>> and configuration taken from produced parameter. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Option B: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to >>>>> Grouped >>>>>>> class (as >>>>>>>>>>>>>>>>>>>>>>>>>> mentioned in the KIP) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for >>>>>>> creating and >>>>>>>>>>>>>>>>>> managing >>>>>>>>>>>>>>>>>>>>>>>>>> internal repartition topics >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already >>>>>>> discussed in the >>>>>>>>>>>>>>>>>> mailing >>>>>>>>>>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were >>>>>>> about this >>>>>>>>>>>>>>>>>> KIP :( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze < >>>>>>>>>>>>>>>>>>>>>>>>>> levani.co...@gmail.com <mailto: >>>>> levani.co...@gmail.com>> >>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around >>>>>>> KIP-221. Going >>>>>>>>>>>>>>>>>> through >>>>>>>>>>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement >>>>>>> around >>>>>>>>>>>>>>>>>> usefulness of >>>>>>>>>>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>>>>>>>>> feature. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I >>>>>>> understood, the >>>>>>>>>>>>>>>>>> most >>>>>>>>>>>>>>>>>>>>>>>>>> optimal solution for me seems the following: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through >>>>>>> method >>>>>>>>>>>>>>>>>> (essentially >>>>>>>>>>>>>>>>>>>>>>>>>> making topic name optional) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions >>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>>>> field. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to >>>>> control >>>>>>>>>>>>>>>>>> parallelism and >>>>>>>>>>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful >>>>> operations. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around >>>>>>>>>>>>>>>>>> KStream#through if >>>>>>>>>>>>>>>>>>>>>>>>>> this changes sound sensible. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Levani >>>>> >>>>> >>> >>> >>
signature.asc
Description: OpenPGP digital signature