Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Matthias J. Sax Sun, 10 Jun 2018 17:26:41 -0700

What is the status of this KIP?

-Matthias



On 2/13/18 1:43 PM, Matthias J. Sax wrote:
> Is there any update for this KIP?
> 
> 
> -Matthias
> 
> On 12/4/17 2:08 PM, Matthias J. Sax wrote:
>> Jeyhun,
>>
>> thanks for updating the KIP.
>>
>> I am wondering if you intend to add a new class `Produced`? There is
>> already `org.apache.kafka.streams.kstream.Produced`. So if we want to
>> add a new class, it must have a different name -- or we might be able to
>> merge both into one?
>>
>> Also, for the KStream overlaods of `through()` and `to()`, can you add
>> the different behavior using different overloads? It's not clear from
>> the KIP what the semantics are.
>>
>>
>> -Matthias
>>
>> On 11/17/17 3:27 PM, Jeyhun Karimov wrote:
>>> Hi,
>>>
>>> Thanks for your comments. I agree with Matthias partially.
>>> I think we should relax some requirements related with to() and through()
>>> methods.
>>> IMHO, Produced class can cover (existing/to be created) topic information,
>>> and which will ease our effort:
>>>
>>> KStream.to(Produced topicInfo)
>>> KStream.through(Produced topicInfo)
>>>
>>> This will decrease the number of overloads but we will need to deprecate
>>> the existing to() and through() methods, perhaps.
>>> I updated the KIP accordingly.
>>>
>>>
>>> Cheers,
>>> Jeyhun
>>>
>>> On Thu, Nov 16, 2017 at 10:21 PM Matthias J. Sax <matth...@confluent.io>
>>> wrote:
>>>
>>>> @Jan:
>>>>
>>>> The `Produced` class was introduced in 1.0 to specify key and valud
>>>> Serdes (and partitioner) if data is written into a topic.
>>>>
>>>> Old API:
>>>>
>>>> KStream#to("topic", keySerde, valueSerde);
>>>>
>>>> New API:
>>>>
>>>> KStream#to("topic", Produced.with(keySerde, valueSerde));
>>>>
>>>>
>>>> This allows to reduce the number of overloads for `to()` (and
>>>> `through()` that follows the same pattern) -- the second parameter is
>>>> used to cover all different variations of option parameters users can
>>>> specify, while we only have 2 overload for `to()` itself.
>>>>
>>>> What is still unclear to me it, what you mean by this topic prefix
>>>> thing? Either a user cares about the topic name and thus, must create
>>>> and manage it manually. Or the user does not care, and Streams create
>>>> it. How would this prefix idea fit in here?
>>>>
>>>>
>>>>
>>>> @Guozhang:
>>>>
>>>> My idea was to extend `Produced` with the hint we want to give for
>>>> creating internal topic and pass a optional `Produced` parameter. There
>>>> are multiple things we can do here:
>>>>
>>>> 1) stream.through(null, Produced...).groupBy().aggregate()
>>>> -> just allow for `null` topic name indicating that Streams should
>>>> create an internal topic
>>>>
>>>> 2) stream.through(Produced...).groupBy().aggregate()
>>>> -> add one overload taking an mandatory `Produced`
>>>>
>>>> We use `Serialized` to picky back the information
>>>>
>>>> 3) stream.groupBy(Serialized...).aggregate()
>>>> and stream.groupByKey(Serialized...).aggregate()
>>>> -> we don't need new top level overloads
>>>>
>>>>
>>>> There are different trade-offs for those alternatives and maybe there
>>>> are other ways to change the API. It's just to push the discussion further.
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 11/12/17 1:22 PM, Jan Filipiak wrote:
>>>>> Hi Gouzhang,
>>>>>
>>>>> this felt like these questions are supposed to be answered by me.
>>>>> I do not understand the first one. I don't understand why the user
>>>>> shouldn't be able to specify a suffix for the topic name.
>>>>>
>>>>>  For the third question I am not 100% familiar if the Produced class
>>>>> came to existence
>>>>> at all. I remember proposing it somewhere in our redo DSL discussion that
>>>>> I dropped out of later. Finally any call that does:
>>>>>
>>>>> 1. create the internal topic
>>>>> 2. register sink
>>>>> 3. register source
>>>>>
>>>>> will always get the work done. If we have a Produced like class. putting
>>>>> all the parameters
>>>>> in there make sense. (Partitioner, serde, PartitionHint, internal, name
>>>>> ... )
>>>>>
>>>>> Hope this helps?
>>>>>
>>>>>
>>>>> On 10.11.2017 07:54, Guozhang Wang wrote:
>>>>>> A few clarification questions on the proposal details.
>>>>>>
>>>>>> 1. API: although the repartition only happens at the final stateful
>>>>>> operations like agg / join, the repartition flag info was actually
>>>> passed
>>>>>> from an earlier operator like map / groupBy. So what should be the new
>>>>>> API
>>>>>> look like? For example, if we do
>>>>>>
>>>>>> stream.groupBy().through("topic-name", Produced..).aggregate
>>>>>>
>>>>>> This would be add a bunch of APIs to GroupedKStream/KTable
>>>>>>
>>>>>> 2. Semantics: as Matthias mentioned, today any topics defined in
>>>>>> "through()" call is considered a user topic, and hence users are
>>>>>> responsible for managing them, including the topic name. For this KIP's
>>>>>> purpose, though, users would not care about the topic name. I.e. as a
>>>>>> user
>>>>>> I still want to make it be an internal topic so that I do not need to
>>>>>> worry
>>>>>> about it at all, but only specify num.partitions.
>>>>>>
>>>>>> 3. Details: in Produced we do not have specs for specifying the
>>>>>> num.partitions or should we repartition or not. So it is still not
>>>>>> clear to
>>>>>> me how we would make use of that to achieve what's in the old
>>>>>> proposal's RepartitionHint class.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 6, 2017 at 1:21 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> bq. enlarge the score of through()
>>>>>>>
>>>>>>> I guess you meant scope.
>>>>>>>
>>>>>>> On Mon, Nov 6, 2017 at 1:15 PM, Jeyhun Karimov <je.kari...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Sorry for the late reply. I am convinced that we should enlarge the
>>>>>>>> score
>>>>>>>> of through() (add more overloads) instead of introducing a separate
>>>> set
>>>>>>> of
>>>>>>>> overloads to other methods.
>>>>>>>> I will update the KIP soon based on the discussion and inform.
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jeyhun
>>>>>>>>
>>>>>>>> On Mon, Nov 6, 2017 at 9:18 PM Jan Filipiak <jan.filip...@trivago.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sorry for not beeing 100% up to date.
>>>>>>>>> Back then we had the discussion that when an operation puts a >Sink<
>>>>>>>>> into the topology, a >Produced<
>>>>>>>>> parameter is added. This produced parameter could have internal or
>>>>>>>>> external. If internal I think the name would still make
>>>>>>>>> a great suffix for the topic name
>>>>>>>>>
>>>>>>>>> Is this plan still around? Otherwise having the name as suffix is
>>>>>>>>> probably always good it can help the user quicker to identify hot
>>>>>>> topics
>>>>>>>>> that need more
>>>>>>>>> partitions if he has many of these internal repartitions
>>>>>>>>>
>>>>>>>>> Best Jan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 06.11.2017 20:13, Matthias J. Sax wrote:
>>>>>>>>>> I absolute agree with what you say. It's not a requirement to
>>>>>>> specify a
>>>>>>>>>> topic name -- and this was the idea -- if user does specify a name,
>>>>>>> we
>>>>>>>>>> treat as is -- if users does not specify a name, Streams create an
>>>>>>>>>> internal topic.
>>>>>>>>>>
>>>>>>>>>> The goal of the Jira is to allow a simplified way to control
>>>>>>>>>> repartitioning (atm, user needs to manually create a topic and use
>>>>>>> via
>>>>>>>>>> through()).
>>>>>>>>>>
>>>>>>>>>> Thus, the idea is to make the topic name parameter of through
>>>>>>> optional.
>>>>>>>>>> It's of course just an idea. Happy do have a other API design. The
>>>>>>> goal
>>>>>>>>>> was, to avoid to many new overloads.
>>>>>>>>>>
>>>>>>>>>>>> Could you clarify exactly what you mean by keeping the current
>>>>>>>>> distinction?
>>>>>>>>>> Current distinction is: user topics are created manually and user
>>>>>>>>>> specifies the name -- internal topics are created by Kafka Streams
>>>>>>> and
>>>>>>>>>> an name is generated automatically.
>>>>>>>>>>
>>>>>>>>>> -> through("user-topic")
>>>>>>>>>> -> through(TopicConfig.withNumberOfPartitions(5)) // Streams creates
>>>>>>>> an
>>>>>>>>>> internal topic
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/6/17 6:56 PM, Thomas Becker wrote:
>>>>>>>>>>> Could you clarify exactly what you mean by keeping the current
>>>>>>>>> distinction?
>>>>>>>>>>> Actually, re-reading the KIP and JIRA, it's not clear that being
>>>>>>> able
>>>>>>>>> to specify a custom name is actually a requirement. If the goal is to
>>>>>>>>> control repartitioning and tune parallelism, maybe we can just
>>>>>>>>> sidestep
>>>>>>>>> this issue altogether by removing the ability to set a different
>>>> name.
>>>>>>>>>>> On Mon, 2017-11-06 at 16:51 +0100, Matthias J. Sax wrote:
>>>>>>>>>>>
>>>>>>>>>>> That's a good point. In current design, we strictly distinguish
>>>>>>> both.
>>>>>>>>>>> For example, the reset tools deletes internal topics (starting with
>>>>>>>>>>> prefix `<application.id>-` and ending with either `-repartition`
>>>> or
>>>>>>>>>>> `-changelog`.
>>>>>>>>>>>
>>>>>>>>>>> Thus, from my point of view, it would make sense to keep the
>>>> current
>>>>>>>>>>> distinction.
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 11/6/17 4:45 PM, Thomas Becker wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think this sounds good as well. It's worth clarifying whether
>>>>>>> topics
>>>>>>>>> that are named by the user but created by streams are considered
>>>>>>>> "internal"
>>>>>>>>> topics also.
>>>>>>>>>>> On Sun, 2017-11-05 at 23:02 +0100, Matthias J. Sax wrote:
>>>>>>>>>>>
>>>>>>>>>>> My idea was, to relax the requirement for through() that a topic
>>>>>>> must
>>>>>>>> be
>>>>>>>>>>> created manually before startup.
>>>>>>>>>>>
>>>>>>>>>>> Thus, if no through() call is made, a (internal) topic is created
>>>>>>> the
>>>>>>>>>>> same way we do it currently.
>>>>>>>>>>>
>>>>>>>>>>> If one uses `through(String topicName)` we keep the current
>>>> behavior
>>>>>>>> and
>>>>>>>>>>> require users to create the topic manually.
>>>>>>>>>>>
>>>>>>>>>>> The reasoning is as follows: if a user creates a topic manually, a
>>>>>>>> user
>>>>>>>>>>> can just use it for repartitioning. As the topic is already there,
>>>>>>>> there
>>>>>>>>>>> is no need to specify any topic configs.
>>>>>>>>>>>
>>>>>>>>>>> We add a new `through()` overload (details TBD) that allows to
>>>>>>> specify
>>>>>>>>>>> topic configs and Streams create the topic with those configs.
>>>>>>>>>>>
>>>>>>>>>>> Reasoning: user don't want to manage topic manually, thus, it's
>>>>>>> still
>>>>>>>> an
>>>>>>>>>>> internal topic and Streams create the topic name automatically as
>>>>>>> for
>>>>>>>>>>> all other internal topics. However, users gets some more control
>>>>>>> about
>>>>>>>>>>> topic parameters like number of partitions (we should discuss what
>>>>>>>> other
>>>>>>>>>>> configs would be useful).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Does this make sense?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/5/17 1:21 AM, Jan Filipiak wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Im not 100 % up to date what version 1.0 DSL looks like ATM.
>>>>>>>>>>> I just would argue that repartitioning should be an own API call
>>>>>>> like
>>>>>>>>>>> through or something.
>>>>>>>>>>> One can use through or to already to get this. I would argue one
>>>>>>>> should
>>>>>>>>>>> look there instead of overloads
>>>>>>>>>>>
>>>>>>>>>>> Best Jan
>>>>>>>>>>>
>>>>>>>>>>> On 04.11.2017 16:01, Jeyhun Karimov wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dear community,
>>>>>>>>>>>
>>>>>>>>>>> I would like to initiate discussion on KIP-221 [1] based on issue
>>>>>>> [2].
>>>>>>>>>>> Please feel free to comment.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>> 221%3A+Repartition+Topic+Hints+in+Streams
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/KAFKA-6037
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Jeyhun
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ________________________________
>>>>>>>>>>>
>>>>>>>>>>> This email and any attachments may contain confidential and
>>>>>>> privileged
>>>>>>>>> material for the sole use of the intended recipient. Any review,
>>>>>>> copying,
>>>>>>>>> or distribution of this email (or any attachments) by others is
>>>>>>>> prohibited.
>>>>>>>>> If you are not the intended recipient, please contact the sender
>>>>>>>>> immediately and permanently delete this email and any attachments. No
>>>>>>>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>>>>>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with
>>>>>>>>> TiVo
>>>>>>>>> Inc. may only be made by a signed written agreement.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ________________________________
>>>>>>>>>>>
>>>>>>>>>>> This email and any attachments may contain confidential and
>>>>>>> privileged
>>>>>>>>> material for the sole use of the intended recipient. Any review,
>>>>>>> copying,
>>>>>>>>> or distribution of this email (or any attachments) by others is
>>>>>>>> prohibited.
>>>>>>>>> If you are not the intended recipient, please contact the sender
>>>>>>>>> immediately and permanently delete this email and any attachments. No
>>>>>>>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>>>>>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with
>>>>>>>>> TiVo
>>>>>>>>> Inc. may only be made by a signed written agreement.
>>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to