Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Matthias J. Sax Mon, 04 Dec 2017 14:09:06 -0800

Jeyhun,

thanks for updating the KIP.


I am wondering if you intend to add a new class `Produced`? There is
already `org.apache.kafka.streams.kstream.Produced`. So if we want to
add a new class, it must have a different name -- or we might be able to
merge both into one?

Also, for the KStream overlaods of `through()` and `to()`, can you add
the different behavior using different overloads? It's not clear from
the KIP what the semantics are.


-Matthias

On 11/17/17 3:27 PM, Jeyhun Karimov wrote:
> Hi,
> 
> Thanks for your comments. I agree with Matthias partially.
> I think we should relax some requirements related with to() and through()
> methods.
> IMHO, Produced class can cover (existing/to be created) topic information,
> and which will ease our effort:
> 
> KStream.to(Produced topicInfo)
> KStream.through(Produced topicInfo)
> 
> This will decrease the number of overloads but we will need to deprecate
> the existing to() and through() methods, perhaps.
> I updated the KIP accordingly.
> 
> 
> Cheers,
> Jeyhun
> 
> On Thu, Nov 16, 2017 at 10:21 PM Matthias J. Sax <[email protected]>
> wrote:
> 
>> @Jan:
>>
>> The `Produced` class was introduced in 1.0 to specify key and valud
>> Serdes (and partitioner) if data is written into a topic.
>>
>> Old API:
>>
>> KStream#to("topic", keySerde, valueSerde);
>>
>> New API:
>>
>> KStream#to("topic", Produced.with(keySerde, valueSerde));
>>
>>
>> This allows to reduce the number of overloads for `to()` (and
>> `through()` that follows the same pattern) -- the second parameter is
>> used to cover all different variations of option parameters users can
>> specify, while we only have 2 overload for `to()` itself.
>>
>> What is still unclear to me it, what you mean by this topic prefix
>> thing? Either a user cares about the topic name and thus, must create
>> and manage it manually. Or the user does not care, and Streams create
>> it. How would this prefix idea fit in here?
>>
>>
>>
>> @Guozhang:
>>
>> My idea was to extend `Produced` with the hint we want to give for
>> creating internal topic and pass a optional `Produced` parameter. There
>> are multiple things we can do here:
>>
>> 1) stream.through(null, Produced...).groupBy().aggregate()
>> -> just allow for `null` topic name indicating that Streams should
>> create an internal topic
>>
>> 2) stream.through(Produced...).groupBy().aggregate()
>> -> add one overload taking an mandatory `Produced`
>>
>> We use `Serialized` to picky back the information
>>
>> 3) stream.groupBy(Serialized...).aggregate()
>> and stream.groupByKey(Serialized...).aggregate()
>> -> we don't need new top level overloads
>>
>>
>> There are different trade-offs for those alternatives and maybe there
>> are other ways to change the API. It's just to push the discussion further.
>>
>>
>> -Matthias
>>
>> On 11/12/17 1:22 PM, Jan Filipiak wrote:
>>> Hi Gouzhang,
>>>
>>> this felt like these questions are supposed to be answered by me.
>>> I do not understand the first one. I don't understand why the user
>>> shouldn't be able to specify a suffix for the topic name.
>>>
>>>  For the third question I am not 100% familiar if the Produced class
>>> came to existence
>>> at all. I remember proposing it somewhere in our redo DSL discussion that
>>> I dropped out of later. Finally any call that does:
>>>
>>> 1. create the internal topic
>>> 2. register sink
>>> 3. register source
>>>
>>> will always get the work done. If we have a Produced like class. putting
>>> all the parameters
>>> in there make sense. (Partitioner, serde, PartitionHint, internal, name
>>> ... )
>>>
>>> Hope this helps?
>>>
>>>
>>> On 10.11.2017 07:54, Guozhang Wang wrote:
>>>> A few clarification questions on the proposal details.
>>>>
>>>> 1. API: although the repartition only happens at the final stateful
>>>> operations like agg / join, the repartition flag info was actually
>> passed
>>>> from an earlier operator like map / groupBy. So what should be the new
>>>> API
>>>> look like? For example, if we do
>>>>
>>>> stream.groupBy().through("topic-name", Produced..).aggregate
>>>>
>>>> This would be add a bunch of APIs to GroupedKStream/KTable
>>>>
>>>> 2. Semantics: as Matthias mentioned, today any topics defined in
>>>> "through()" call is considered a user topic, and hence users are
>>>> responsible for managing them, including the topic name. For this KIP's
>>>> purpose, though, users would not care about the topic name. I.e. as a
>>>> user
>>>> I still want to make it be an internal topic so that I do not need to
>>>> worry
>>>> about it at all, but only specify num.partitions.
>>>>
>>>> 3. Details: in Produced we do not have specs for specifying the
>>>> num.partitions or should we repartition or not. So it is still not
>>>> clear to
>>>> me how we would make use of that to achieve what's in the old
>>>> proposal's RepartitionHint class.
>>>>
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Mon, Nov 6, 2017 at 1:21 PM, Ted Yu <[email protected]> wrote:
>>>>
>>>>> bq. enlarge the score of through()
>>>>>
>>>>> I guess you meant scope.
>>>>>
>>>>> On Mon, Nov 6, 2017 at 1:15 PM, Jeyhun Karimov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Sorry for the late reply. I am convinced that we should enlarge the
>>>>>> score
>>>>>> of through() (add more overloads) instead of introducing a separate
>> set
>>>>> of
>>>>>> overloads to other methods.
>>>>>> I will update the KIP soon based on the discussion and inform.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Jeyhun
>>>>>>
>>>>>> On Mon, Nov 6, 2017 at 9:18 PM Jan Filipiak <[email protected]
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry for not beeing 100% up to date.
>>>>>>> Back then we had the discussion that when an operation puts a >Sink<
>>>>>>> into the topology, a >Produced<
>>>>>>> parameter is added. This produced parameter could have internal or
>>>>>>> external. If internal I think the name would still make
>>>>>>> a great suffix for the topic name
>>>>>>>
>>>>>>> Is this plan still around? Otherwise having the name as suffix is
>>>>>>> probably always good it can help the user quicker to identify hot
>>>>> topics
>>>>>>> that need more
>>>>>>> partitions if he has many of these internal repartitions
>>>>>>>
>>>>>>> Best Jan
>>>>>>>
>>>>>>>
>>>>>>> On 06.11.2017 20:13, Matthias J. Sax wrote:
>>>>>>>> I absolute agree with what you say. It's not a requirement to
>>>>> specify a
>>>>>>>> topic name -- and this was the idea -- if user does specify a name,
>>>>> we
>>>>>>>> treat as is -- if users does not specify a name, Streams create an
>>>>>>>> internal topic.
>>>>>>>>
>>>>>>>> The goal of the Jira is to allow a simplified way to control
>>>>>>>> repartitioning (atm, user needs to manually create a topic and use
>>>>> via
>>>>>>>> through()).
>>>>>>>>
>>>>>>>> Thus, the idea is to make the topic name parameter of through
>>>>> optional.
>>>>>>>> It's of course just an idea. Happy do have a other API design. The
>>>>> goal
>>>>>>>> was, to avoid to many new overloads.
>>>>>>>>
>>>>>>>>>> Could you clarify exactly what you mean by keeping the current
>>>>>>> distinction?
>>>>>>>> Current distinction is: user topics are created manually and user
>>>>>>>> specifies the name -- internal topics are created by Kafka Streams
>>>>> and
>>>>>>>> an name is generated automatically.
>>>>>>>>
>>>>>>>> -> through("user-topic")
>>>>>>>> -> through(TopicConfig.withNumberOfPartitions(5)) // Streams creates
>>>>>> an
>>>>>>>> internal topic
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/6/17 6:56 PM, Thomas Becker wrote:
>>>>>>>>> Could you clarify exactly what you mean by keeping the current
>>>>>>> distinction?
>>>>>>>>> Actually, re-reading the KIP and JIRA, it's not clear that being
>>>>> able
>>>>>>> to specify a custom name is actually a requirement. If the goal is to
>>>>>>> control repartitioning and tune parallelism, maybe we can just
>>>>>>> sidestep
>>>>>>> this issue altogether by removing the ability to set a different
>> name.
>>>>>>>>> On Mon, 2017-11-06 at 16:51 +0100, Matthias J. Sax wrote:
>>>>>>>>>
>>>>>>>>> That's a good point. In current design, we strictly distinguish
>>>>> both.
>>>>>>>>> For example, the reset tools deletes internal topics (starting with
>>>>>>>>> prefix `<application.id>-` and ending with either `-repartition`
>> or
>>>>>>>>> `-changelog`.
>>>>>>>>>
>>>>>>>>> Thus, from my point of view, it would make sense to keep the
>> current
>>>>>>>>> distinction.
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>> On 11/6/17 4:45 PM, Thomas Becker wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this sounds good as well. It's worth clarifying whether
>>>>> topics
>>>>>>> that are named by the user but created by streams are considered
>>>>>> "internal"
>>>>>>> topics also.
>>>>>>>>> On Sun, 2017-11-05 at 23:02 +0100, Matthias J. Sax wrote:
>>>>>>>>>
>>>>>>>>> My idea was, to relax the requirement for through() that a topic
>>>>> must
>>>>>> be
>>>>>>>>> created manually before startup.
>>>>>>>>>
>>>>>>>>> Thus, if no through() call is made, a (internal) topic is created
>>>>> the
>>>>>>>>> same way we do it currently.
>>>>>>>>>
>>>>>>>>> If one uses `through(String topicName)` we keep the current
>> behavior
>>>>>> and
>>>>>>>>> require users to create the topic manually.
>>>>>>>>>
>>>>>>>>> The reasoning is as follows: if a user creates a topic manually, a
>>>>>> user
>>>>>>>>> can just use it for repartitioning. As the topic is already there,
>>>>>> there
>>>>>>>>> is no need to specify any topic configs.
>>>>>>>>>
>>>>>>>>> We add a new `through()` overload (details TBD) that allows to
>>>>> specify
>>>>>>>>> topic configs and Streams create the topic with those configs.
>>>>>>>>>
>>>>>>>>> Reasoning: user don't want to manage topic manually, thus, it's
>>>>> still
>>>>>> an
>>>>>>>>> internal topic and Streams create the topic name automatically as
>>>>> for
>>>>>>>>> all other internal topics. However, users gets some more control
>>>>> about
>>>>>>>>> topic parameters like number of partitions (we should discuss what
>>>>>> other
>>>>>>>>> configs would be useful).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does this make sense?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/5/17 1:21 AM, Jan Filipiak wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Im not 100 % up to date what version 1.0 DSL looks like ATM.
>>>>>>>>> I just would argue that repartitioning should be an own API call
>>>>> like
>>>>>>>>> through or something.
>>>>>>>>> One can use through or to already to get this. I would argue one
>>>>>> should
>>>>>>>>> look there instead of overloads
>>>>>>>>>
>>>>>>>>> Best Jan
>>>>>>>>>
>>>>>>>>> On 04.11.2017 16:01, Jeyhun Karimov wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear community,
>>>>>>>>>
>>>>>>>>> I would like to initiate discussion on KIP-221 [1] based on issue
>>>>> [2].
>>>>>>>>> Please feel free to comment.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>> 221%3A+Repartition+Topic+Hints+in+Streams
>>>>>>>>> [2] https://issues.apache.org/jira/browse/KAFKA-6037
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Jeyhun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>>
>>>>>>>>> This email and any attachments may contain confidential and
>>>>> privileged
>>>>>>> material for the sole use of the intended recipient. Any review,
>>>>> copying,
>>>>>>> or distribution of this email (or any attachments) by others is
>>>>>> prohibited.
>>>>>>> If you are not the intended recipient, please contact the sender
>>>>>>> immediately and permanently delete this email and any attachments. No
>>>>>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>>>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with
>>>>>>> TiVo
>>>>>>> Inc. may only be made by a signed written agreement.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>>
>>>>>>>>> This email and any attachments may contain confidential and
>>>>> privileged
>>>>>>> material for the sole use of the intended recipient. Any review,
>>>>> copying,
>>>>>>> or distribution of this email (or any attachments) by others is
>>>>>> prohibited.
>>>>>>> If you are not the intended recipient, please contact the sender
>>>>>>> immediately and permanently delete this email and any attachments. No
>>>>>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>>>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with
>>>>>>> TiVo
>>>>>>> Inc. may only be made by a signed written agreement.
>>>>>>>
>>>>
>>>>
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to