Hello Matthias, Thanks for the feedback and ideas. I like the idea of introducing dedicated `Topic` class for topic configuration for internal operators like `groupedBy`. Would be great to hear others opinion about this as well.
Kind regards, Levani > On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <matth...@confluent.io> wrote: > > Levani, > > Thanks for picking up this KIP! And thanks for summarizing everything. > Even if some points may have been discussed already (can't really > remember), it's helpful to get a good summary to refresh the discussion. > > I think your reasoning makes sense. With regard to the distinction > between operators that manage topics and operators that use user-created > topics: Following this argument, it might indicate that leaving > `through()` as-is (as an operator that uses use-defined topics) and > introducing a new `repartition()` operator (an operator that manages > topics itself) might be good. Otherwise, there is one operator > `through()` that sometimes manages topics but sometimes not; a different > name, ie, new operator would make the distinction clearer. > > About adding `numOfPartitions` to `Grouped`. I am wondering if the same > argument as for `Produced` does apply and adding it is semantically > questionable? Might be good to get opinions of others on this, too. I am > not sure myself what solution I prefer atm. > > So far, KS uses configuration objects that allow to configure a certain > "entity" like a consumer, producer, store. If we assume that a topic is > a similar entity, I am wonder if we should have a > `Topic#withNumberOfPartitions()` class and method instead of a plain > integer? This would allow us to add other configs, like replication > factor, retention-time etc, easily, without the need to change the "main > API". > > Just want to give some ideas. Not sure if I like them myself. :) > > > -Matthias > > > > On 7/1/19 1:04 AM, Levani Kokhreidze wrote: >> Actually, giving it more though - maybe enhancing Produced with num of >> partitions configuration is not the best approach. Let me explain why: >> >> 1) If we enhance Produced class with this configuration, this will also >> affect KStream#to operation. Since KStream#to is the final sink of the >> topology, for me, it seems to be reasonable assumption that user needs to >> manually create sink topic in advance. And in that case, having num of >> partitions configuration doesn’t make much sense. >> >> 2) Looking at Produced class, based on API contract, seems like Produced is >> designed to be something that is explicitly for producer (key serializer, >> value serializer, partitioner those all are producer specific >> configurations) and num of partitions is topic level configuration. And I >> don’t think mixing topic and producer level configurations together in one >> class is the good approach. >> >> 3) Looking at KStream interface, seems like Produced parameter is for >> operations that work with non-internal (e.g topics created and managed >> internally by Kafka Streams) topics and I think we should leave it as it is >> in that case. >> >> Taking all this things into account, I think we should distinguish between >> DSL operations, where Kafka Streams should create and manage internal topics >> (KStream#groupBy) vs topics that should be created in advance (e.g >> KStream#to). >> >> To sum it up, I think adding numPartitions configuration in Produced will >> result in mixing topic and producer level configuration in one class and >> it’s gonna break existing API semantics. >> >> Regarding making topic name optional in KStream#through - I think underline >> idea is very useful and giving users possibility to specify num of >> partitions there is even more useful :) Considering arguments against adding >> num of partitions in Produced class, I see two options here: >> 1) Add following method overloads >> * through() - topic will be auto-generated and num of partitions will >> be taken from source topic >> * through(final int numOfPartitions) - topic will be auto generated >> with specified num of partitions >> * through(final int numOfPartitions, final Produced<K, V> produced) - >> topic will be with generated with specified num of partitions and >> configuration taken from produced parameter. >> 2) Leave KStream#through as it is and introduce new method - >> KStream#repartition (I think Matthias suggested this in one of the threads) >> >> Considering all mentioned above I propose the following plan: >> >> Option A: >> 1) Leave Produced as it is >> 2) Add num of partitions configuration to Grouped class (as mentioned in the >> KIP) >> 3) Add following method overloads to KStream#through >> * through() - topic will be auto-generated and num of partitions will >> be taken from source topic >> * through(final int numOfPartitions) - topic will be auto generated >> with specified num of partitions >> * through(final int numOfPartitions, final Produced<K, V> produced) - >> topic will be with generated with specified num of partitions and >> configuration taken from produced parameter. >> >> Option B: >> 1) Leave Produced as it is >> 2) Add num of partitions configuration to Grouped class (as mentioned in the >> KIP) >> 3) Add new operator KStream#repartition for creating and managing internal >> repartition topics >> >> P.S. I’m sorry if all of this was already discussed in the mailing list, but >> I kinda got with all the threads that were about this KIP :( >> >> Kind regards, >> Levani >> >>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <levani.co...@gmail.com> >>> wrote: >>> >>> Hello, >>> >>> I would like to resurrect discussion around KIP-221. Going through the >>> discussion thread, there’s seems to agreement around usefulness of this >>> feature. >>> Regarding the implementation, as far as I understood, the most optimal >>> solution for me seems the following: >>> >>> 1) Add two method overloads to KStream#through method (essentially making >>> topic name optional) >>> 2) Enhance Produced class with numOfPartitions configuration field. >>> >>> Those two changes will allow DSL users to control parallelism and trigger >>> re-partition without doing stateful operations. >>> >>> I will update KIP with interface changes around KStream#through if this >>> changes sound sensible. >>> >>> Kind regards, >>> Levani >> >> >