Hey John, Oh that’s interesting use-case. Do I understand this correctly, in your example I would first issue repartition(Repartitioned) with proper partitioner that essentially would be the same as the topic I want to join with and then do the KStream#join with DSL?
Regards, Levani > On Jul 17, 2019, at 11:11 PM, John Roesler <j...@confluent.io> wrote: > > Hey, all, just to chime in, > > I think it might be useful to have an option to specify the > partitioner. The case I have in mind is that some data may get > repartitioned and then joined with an input topic. If the right-side > input topic uses a custom partitioning strategy, then the > repartitioned stream also needs to be partitioned with the same > strategy. > > Does that make sense, or did I maybe miss something important? > > Thanks, > -John > > On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze > <levani.co...@gmail.com> wrote: >> >> Yes, I was thinking about it as well. To be honest I’m not sure about it yet. >> As Kafka Streams DSL user, I don’t really think I would need control over >> partitioner for internal topics. >> As a user, I would assume that Kafka Streams knows best how to partition >> data for internal topics. >> In this KIP I wrote that Produced should be used only for topics that are >> created by user In advance. >> In those cases maybe it make sense to have possibility to specify the >> partitioner. >> I don’t have clear answer on that yet, but I guess specifying the >> partitioner can be added as well if there’s agreement on this. >> >> Regards, >> Levani >> >>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <sop...@confluent.io> >>> wrote: >>> >>> Thanks for clearing that up. I agree that Repartitioned would be a useful >>> addition. I'm wondering if it might also need to have >>> a withStreamPartitioner method/field, similar to Produced? I'm not sure how >>> widely this feature is really used, but seems it should be available for >>> repartition topics. >>> >>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <levani.co...@gmail.com> >>> wrote: >>> >>>> Hey Sophie, >>>> >>>> In both cases KStream#repartition and KStream#repartition(Repartitioned) >>>> topic will be created and managed by Kafka Streams. >>>> Idea of Repartitioned is to give user more control over the topic such as >>>> num of partitions. >>>> I feel like Repartitioned parameter is something that is missing in >>>> current DSL design. >>>> Essentially giving user control over parallelism by configuring num of >>>> partitions for internal topics. >>>> >>>> Hope this answers your question. >>>> >>>> Regards, >>>> Levani >>>> >>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <sop...@confluent.io> >>>> wrote: >>>>> >>>>> Hey Levani, >>>>> >>>>> Thanks for the KIP! Can you clarify one thing for me -- for the >>>>> KStream#repartition signature taking a Repartitioned, will the topic be >>>>> auto-created by Streams (which seems to be the case for the signature >>>>> without a Repartitioned) or does it have to be pre-created? The wording >>>> in >>>>> the KIP makes it seem like one version of the method will auto-create >>>>> topics while the other will not. >>>>> >>>>> Cheers, >>>>> Sophie >>>>> >>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze < >>>> levani.co...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> One more bump about KIP-221 ( >>>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> < >>>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>> ) >>>>>> so it doesn’t get lost in mailing list :) >>>>>> Would love to hear communities opinions/concerns about this KIP. >>>>>> >>>>>> Regards, >>>>>> Levani >>>>>> >>>>>> >>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <levani.co...@gmail.com >>>>> >>>>>> wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Kind reminder about this KIP: >>>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> < >>>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Levani >>>>>>> >>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze < >>>> levani.co...@gmail.com >>>>>> <mailto:levani.co...@gmail.com>> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> In order to move this KIP forward, I’ve updated confluence page with >>>>>> the new proposal >>>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>> < >>>>>> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint >>>>>>> >>>>>>>> I’ve also filled “Rejected Alternatives” section. >>>>>>>> >>>>>>>> Looking forward to discuss this KIP :) >>>>>>>> >>>>>>>> King regards, >>>>>>>> Levani >>>>>>>> >>>>>>>> >>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze < >>>> levani.co...@gmail.com >>>>>> <mailto:levani.co...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> Hello Matthias, >>>>>>>>> >>>>>>>>> Thanks for the feedback and ideas. >>>>>>>>> I like the idea of introducing dedicated `Topic` class for topic >>>>>> configuration for internal operators like `groupedBy`. >>>>>>>>> Would be great to hear others opinion about this as well. >>>>>>>>> >>>>>>>>> Kind regards, >>>>>>>>> Levani >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <matth...@confluent.io >>>>>> <mailto:matth...@confluent.io>> wrote: >>>>>>>>>> >>>>>>>>>> Levani, >>>>>>>>>> >>>>>>>>>> Thanks for picking up this KIP! And thanks for summarizing >>>> everything. >>>>>>>>>> Even if some points may have been discussed already (can't really >>>>>>>>>> remember), it's helpful to get a good summary to refresh the >>>>>> discussion. >>>>>>>>>> >>>>>>>>>> I think your reasoning makes sense. With regard to the distinction >>>>>>>>>> between operators that manage topics and operators that use >>>>>> user-created >>>>>>>>>> topics: Following this argument, it might indicate that leaving >>>>>>>>>> `through()` as-is (as an operator that uses use-defined topics) and >>>>>>>>>> introducing a new `repartition()` operator (an operator that manages >>>>>>>>>> topics itself) might be good. Otherwise, there is one operator >>>>>>>>>> `through()` that sometimes manages topics but sometimes not; a >>>>>> different >>>>>>>>>> name, ie, new operator would make the distinction clearer. >>>>>>>>>> >>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am wondering if the >>>>>> same >>>>>>>>>> argument as for `Produced` does apply and adding it is semantically >>>>>>>>>> questionable? Might be good to get opinions of others on this, too. >>>> I >>>>>> am >>>>>>>>>> not sure myself what solution I prefer atm. >>>>>>>>>> >>>>>>>>>> So far, KS uses configuration objects that allow to configure a >>>>>> certain >>>>>>>>>> "entity" like a consumer, producer, store. If we assume that a topic >>>>>> is >>>>>>>>>> a similar entity, I am wonder if we should have a >>>>>>>>>> `Topic#withNumberOfPartitions()` class and method instead of a plain >>>>>>>>>> integer? This would allow us to add other configs, like replication >>>>>>>>>> factor, retention-time etc, easily, without the need to change the >>>>>> "main >>>>>>>>>> API". >>>>>>>>>> >>>>>>>>>> Just want to give some ideas. Not sure if I like them myself. :) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -Matthias >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote: >>>>>>>>>>> Actually, giving it more though - maybe enhancing Produced with num >>>>>> of partitions configuration is not the best approach. Let me explain >>>> why: >>>>>>>>>>> >>>>>>>>>>> 1) If we enhance Produced class with this configuration, this will >>>>>> also affect KStream#to operation. Since KStream#to is the final sink of >>>> the >>>>>> topology, for me, it seems to be reasonable assumption that user needs >>>> to >>>>>> manually create sink topic in advance. And in that case, having num of >>>>>> partitions configuration doesn’t make much sense. >>>>>>>>>>> >>>>>>>>>>> 2) Looking at Produced class, based on API contract, seems like >>>>>> Produced is designed to be something that is explicitly for producer >>>> (key >>>>>> serializer, value serializer, partitioner those all are producer >>>> specific >>>>>> configurations) and num of partitions is topic level configuration. And >>>> I >>>>>> don’t think mixing topic and producer level configurations together in >>>> one >>>>>> class is the good approach. >>>>>>>>>>> >>>>>>>>>>> 3) Looking at KStream interface, seems like Produced parameter is >>>>>> for operations that work with non-internal (e.g topics created and >>>> managed >>>>>> internally by Kafka Streams) topics and I think we should leave it as >>>> it is >>>>>> in that case. >>>>>>>>>>> >>>>>>>>>>> Taking all this things into account, I think we should distinguish >>>>>> between DSL operations, where Kafka Streams should create and manage >>>>>> internal topics (KStream#groupBy) vs topics that should be created in >>>>>> advance (e.g KStream#to). >>>>>>>>>>> >>>>>>>>>>> To sum it up, I think adding numPartitions configuration in >>>> Produced >>>>>> will result in mixing topic and producer level configuration in one >>>> class >>>>>> and it’s gonna break existing API semantics. >>>>>>>>>>> >>>>>>>>>>> Regarding making topic name optional in KStream#through - I think >>>>>> underline idea is very useful and giving users possibility to specify >>>> num >>>>>> of partitions there is even more useful :) Considering arguments against >>>>>> adding num of partitions in Produced class, I see two options here: >>>>>>>>>>> 1) Add following method overloads >>>>>>>>>>> * through() - topic will be auto-generated and num of partitions >>>>>> will be taken from source topic >>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto >>>>>> generated with specified num of partitions >>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V> >>>>>> produced) - topic will be with generated with specified num of >>>> partitions >>>>>> and configuration taken from produced parameter. >>>>>>>>>>> 2) Leave KStream#through as it is and introduce new method - >>>>>> KStream#repartition (I think Matthias suggested this in one of the >>>> threads) >>>>>>>>>>> >>>>>>>>>>> Considering all mentioned above I propose the following plan: >>>>>>>>>>> >>>>>>>>>>> Option A: >>>>>>>>>>> 1) Leave Produced as it is >>>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as >>>>>> mentioned in the KIP) >>>>>>>>>>> 3) Add following method overloads to KStream#through >>>>>>>>>>> * through() - topic will be auto-generated and num of partitions >>>>>> will be taken from source topic >>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto >>>>>> generated with specified num of partitions >>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V> >>>>>> produced) - topic will be with generated with specified num of >>>> partitions >>>>>> and configuration taken from produced parameter. >>>>>>>>>>> >>>>>>>>>>> Option B: >>>>>>>>>>> 1) Leave Produced as it is >>>>>>>>>>> 2) Add num of partitions configuration to Grouped class (as >>>>>> mentioned in the KIP) >>>>>>>>>>> 3) Add new operator KStream#repartition for creating and managing >>>>>> internal repartition topics >>>>>>>>>>> >>>>>>>>>>> P.S. I’m sorry if all of this was already discussed in the mailing >>>>>> list, but I kinda got with all the threads that were about this KIP :( >>>>>>>>>>> >>>>>>>>>>> Kind regards, >>>>>>>>>>> Levani >>>>>>>>>>> >>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze < >>>>>> levani.co...@gmail.com <mailto:levani.co...@gmail.com>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> I would like to resurrect discussion around KIP-221. Going through >>>>>> the discussion thread, there’s seems to agreement around usefulness of >>>> this >>>>>> feature. >>>>>>>>>>>> Regarding the implementation, as far as I understood, the most >>>>>> optimal solution for me seems the following: >>>>>>>>>>>> >>>>>>>>>>>> 1) Add two method overloads to KStream#through method (essentially >>>>>> making topic name optional) >>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions configuration >>>> field. >>>>>>>>>>>> >>>>>>>>>>>> Those two changes will allow DSL users to control parallelism and >>>>>> trigger re-partition without doing stateful operations. >>>>>>>>>>>> >>>>>>>>>>>> I will update KIP with interface changes around KStream#through if >>>>>> this changes sound sensible. >>>>>>>>>>>> >>>>>>>>>>>> Kind regards, >>>>>>>>>>>> Levani >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> >>