Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Jeyhun Karimov Thu, 04 Oct 2018 14:24:46 -0700

Hi Lei,

Please feel free to take over the KIP.


Cheers,
Jeyhun

On Fri, Sep 21, 2018, 22:27 Lei Chen <ley...@gmail.com> wrote:

> Hi,
>
> Just want to know is anyone actively working on this and also KAFKA-4835
> <https://issues.apache.org/jira/browse/KAFKA-4835>? Seems like the JIRA
> has been inactive for couple months. We want this feature and would like to
> move it forward if no one else is working on it.
>
>
> Lei
>
> On Wed, Jun 20, 2018 at 7:27 PM Matthias J. Sax <matth...@confluent.io>
> wrote:
>
>> No worries. It's just good to know. It seems that some other people are
>> interested to drive this further. So we will just "reassign" it to them.
>>
>> Thanks for letting us know.
>>
>>
>> -Matthias
>>
>> On 6/20/18 2:51 PM, Jeyhun Karimov wrote:
>> > Hi Matthias, all,
>> >
>> > Currently, I am not able to complete this KIP. Please accept my
>> > apologies for that.
>> >
>> >
>> > Cheers,
>> > Jeyhun
>> >
>> > On Mon, Jun 11, 2018 at 2:25 AM Matthias J. Sax <matth...@confluent.io
>> > <mailto:matth...@confluent.io>> wrote:
>> >
>> >     What is the status of this KIP?
>> >
>> >     -Matthias
>> >
>> >
>> >     On 2/13/18 1:43 PM, Matthias J. Sax wrote:
>> >     > Is there any update for this KIP?
>> >     >
>> >     >
>> >     > -Matthias
>> >     >
>> >     > On 12/4/17 2:08 PM, Matthias J. Sax wrote:
>> >     >> Jeyhun,
>> >     >>
>> >     >> thanks for updating the KIP.
>> >     >>
>> >     >> I am wondering if you intend to add a new class `Produced`?
>> There is
>> >     >> already `org.apache.kafka.streams.kstream.Produced`. So if we
>> want to
>> >     >> add a new class, it must have a different name -- or we might be
>> >     able to
>> >     >> merge both into one?
>> >     >>
>> >     >> Also, for the KStream overlaods of `through()` and `to()`, can
>> >     you add
>> >     >> the different behavior using different overloads? It's not clear
>> from
>> >     >> the KIP what the semantics are.
>> >     >>
>> >     >>
>> >     >> -Matthias
>> >     >>
>> >     >> On 11/17/17 3:27 PM, Jeyhun Karimov wrote:
>> >     >>> Hi,
>> >     >>>
>> >     >>> Thanks for your comments. I agree with Matthias partially.
>> >     >>> I think we should relax some requirements related with to() and
>> >     through()
>> >     >>> methods.
>> >     >>> IMHO, Produced class can cover (existing/to be created) topic
>> >     information,
>> >     >>> and which will ease our effort:
>> >     >>>
>> >     >>> KStream.to(Produced topicInfo)
>> >     >>> KStream.through(Produced topicInfo)
>> >     >>>
>> >     >>> This will decrease the number of overloads but we will need to
>> >     deprecate
>> >     >>> the existing to() and through() methods, perhaps.
>> >     >>> I updated the KIP accordingly.
>> >     >>>
>> >     >>>
>> >     >>> Cheers,
>> >     >>> Jeyhun
>> >     >>>
>> >     >>> On Thu, Nov 16, 2017 at 10:21 PM Matthias J. Sax
>> >     <matth...@confluent.io <mailto:matth...@confluent.io>>
>> >     >>> wrote:
>> >     >>>
>> >     >>>> @Jan:
>> >     >>>>
>> >     >>>> The `Produced` class was introduced in 1.0 to specify key and
>> valud
>> >     >>>> Serdes (and partitioner) if data is written into a topic.
>> >     >>>>
>> >     >>>> Old API:
>> >     >>>>
>> >     >>>> KStream#to("topic", keySerde, valueSerde);
>> >     >>>>
>> >     >>>> New API:
>> >     >>>>
>> >     >>>> KStream#to("topic", Produced.with(keySerde, valueSerde));
>> >     >>>>
>> >     >>>>
>> >     >>>> This allows to reduce the number of overloads for `to()` (and
>> >     >>>> `through()` that follows the same pattern) -- the second
>> >     parameter is
>> >     >>>> used to cover all different variations of option parameters
>> >     users can
>> >     >>>> specify, while we only have 2 overload for `to()` itself.
>> >     >>>>
>> >     >>>> What is still unclear to me it, what you mean by this topic
>> prefix
>> >     >>>> thing? Either a user cares about the topic name and thus, must
>> >     create
>> >     >>>> and manage it manually. Or the user does not care, and Streams
>> >     create
>> >     >>>> it. How would this prefix idea fit in here?
>> >     >>>>
>> >     >>>>
>> >     >>>>
>> >     >>>> @Guozhang:
>> >     >>>>
>> >     >>>> My idea was to extend `Produced` with the hint we want to give
>> for
>> >     >>>> creating internal topic and pass a optional `Produced`
>> >     parameter. There
>> >     >>>> are multiple things we can do here:
>> >     >>>>
>> >     >>>> 1) stream.through(null, Produced...).groupBy().aggregate()
>> >     >>>> -> just allow for `null` topic name indicating that Streams
>> should
>> >     >>>> create an internal topic
>> >     >>>>
>> >     >>>> 2) stream.through(Produced...).groupBy().aggregate()
>> >     >>>> -> add one overload taking an mandatory `Produced`
>> >     >>>>
>> >     >>>> We use `Serialized` to picky back the information
>> >     >>>>
>> >     >>>> 3) stream.groupBy(Serialized...).aggregate()
>> >     >>>> and stream.groupByKey(Serialized...).aggregate()
>> >     >>>> -> we don't need new top level overloads
>> >     >>>>
>> >     >>>>
>> >     >>>> There are different trade-offs for those alternatives and maybe
>> >     there
>> >     >>>> are other ways to change the API. It's just to push the
>> >     discussion further.
>> >     >>>>
>> >     >>>>
>> >     >>>> -Matthias
>> >     >>>>
>> >     >>>> On 11/12/17 1:22 PM, Jan Filipiak wrote:
>> >     >>>>> Hi Gouzhang,
>> >     >>>>>
>> >     >>>>> this felt like these questions are supposed to be answered by
>> me.
>> >     >>>>> I do not understand the first one. I don't understand why the
>> user
>> >     >>>>> shouldn't be able to specify a suffix for the topic name.
>> >     >>>>>
>> >     >>>>>  For the third question I am not 100% familiar if the Produced
>> >     class
>> >     >>>>> came to existence
>> >     >>>>> at all. I remember proposing it somewhere in our redo DSL
>> >     discussion that
>> >     >>>>> I dropped out of later. Finally any call that does:
>> >     >>>>>
>> >     >>>>> 1. create the internal topic
>> >     >>>>> 2. register sink
>> >     >>>>> 3. register source
>> >     >>>>>
>> >     >>>>> will always get the work done. If we have a Produced like
>> >     class. putting
>> >     >>>>> all the parameters
>> >     >>>>> in there make sense. (Partitioner, serde, PartitionHint,
>> >     internal, name
>> >     >>>>> ... )
>> >     >>>>>
>> >     >>>>> Hope this helps?
>> >     >>>>>
>> >     >>>>>
>> >     >>>>> On 10.11.2017 07:54, Guozhang Wang wrote:
>> >     >>>>>> A few clarification questions on the proposal details.
>> >     >>>>>>
>> >     >>>>>> 1. API: although the repartition only happens at the final
>> >     stateful
>> >     >>>>>> operations like agg / join, the repartition flag info was
>> >     actually
>> >     >>>> passed
>> >     >>>>>> from an earlier operator like map / groupBy. So what should
>> >     be the new
>> >     >>>>>> API
>> >     >>>>>> look like? For example, if we do
>> >     >>>>>>
>> >     >>>>>> stream.groupBy().through("topic-name", Produced..).aggregate
>> >     >>>>>>
>> >     >>>>>> This would be add a bunch of APIs to GroupedKStream/KTable
>> >     >>>>>>
>> >     >>>>>> 2. Semantics: as Matthias mentioned, today any topics
>> defined in
>> >     >>>>>> "through()" call is considered a user topic, and hence users
>> are
>> >     >>>>>> responsible for managing them, including the topic name. For
>> >     this KIP's
>> >     >>>>>> purpose, though, users would not care about the topic name.
>> >     I.e. as a
>> >     >>>>>> user
>> >     >>>>>> I still want to make it be an internal topic so that I do not
>> >     need to
>> >     >>>>>> worry
>> >     >>>>>> about it at all, but only specify num.partitions.
>> >     >>>>>>
>> >     >>>>>> 3. Details: in Produced we do not have specs for specifying
>> the
>> >     >>>>>> num.partitions or should we repartition or not. So it is
>> >     still not
>> >     >>>>>> clear to
>> >     >>>>>> me how we would make use of that to achieve what's in the old
>> >     >>>>>> proposal's RepartitionHint class.
>> >     >>>>>>
>> >     >>>>>>
>> >     >>>>>>
>> >     >>>>>> Guozhang
>> >     >>>>>>
>> >     >>>>>>
>> >     >>>>>> On Mon, Nov 6, 2017 at 1:21 PM, Ted Yu <yuzhih...@gmail.com
>> >     <mailto:yuzhih...@gmail.com>> wrote:
>> >     >>>>>>
>> >     >>>>>>> bq. enlarge the score of through()
>> >     >>>>>>>
>> >     >>>>>>> I guess you meant scope.
>> >     >>>>>>>
>> >     >>>>>>> On Mon, Nov 6, 2017 at 1:15 PM, Jeyhun Karimov
>> >     <je.kari...@gmail.com <mailto:je.kari...@gmail.com>>
>> >     >>>>>>> wrote:
>> >     >>>>>>>
>> >     >>>>>>>> Hi,
>> >     >>>>>>>>
>> >     >>>>>>>> Sorry for the late reply. I am convinced that we should
>> >     enlarge the
>> >     >>>>>>>> score
>> >     >>>>>>>> of through() (add more overloads) instead of introducing a
>> >     separate
>> >     >>>> set
>> >     >>>>>>> of
>> >     >>>>>>>> overloads to other methods.
>> >     >>>>>>>> I will update the KIP soon based on the discussion and
>> inform.
>> >     >>>>>>>>
>> >     >>>>>>>>
>> >     >>>>>>>> Cheers,
>> >     >>>>>>>> Jeyhun
>> >     >>>>>>>>
>> >     >>>>>>>> On Mon, Nov 6, 2017 at 9:18 PM Jan Filipiak
>> >     <jan.filip...@trivago.com <mailto:jan.filip...@trivago.com>
>> >     >>>>>
>> >     >>>>>>>> wrote:
>> >     >>>>>>>>
>> >     >>>>>>>>> Sorry for not beeing 100% up to date.
>> >     >>>>>>>>> Back then we had the discussion that when an operation
>> >     puts a >Sink<
>> >     >>>>>>>>> into the topology, a >Produced<
>> >     >>>>>>>>> parameter is added. This produced parameter could have
>> >     internal or
>> >     >>>>>>>>> external. If internal I think the name would still make
>> >     >>>>>>>>> a great suffix for the topic name
>> >     >>>>>>>>>
>> >     >>>>>>>>> Is this plan still around? Otherwise having the name as
>> >     suffix is
>> >     >>>>>>>>> probably always good it can help the user quicker to
>> >     identify hot
>> >     >>>>>>> topics
>> >     >>>>>>>>> that need more
>> >     >>>>>>>>> partitions if he has many of these internal repartitions
>> >     >>>>>>>>>
>> >     >>>>>>>>> Best Jan
>> >     >>>>>>>>>
>> >     >>>>>>>>>
>> >     >>>>>>>>> On 06.11.2017 20:13, Matthias J. Sax wrote:
>> >     >>>>>>>>>> I absolute agree with what you say. It's not a
>> requirement to
>> >     >>>>>>> specify a
>> >     >>>>>>>>>> topic name -- and this was the idea -- if user does
>> >     specify a name,
>> >     >>>>>>> we
>> >     >>>>>>>>>> treat as is -- if users does not specify a name, Streams
>> >     create an
>> >     >>>>>>>>>> internal topic.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> The goal of the Jira is to allow a simplified way to
>> control
>> >     >>>>>>>>>> repartitioning (atm, user needs to manually create a
>> >     topic and use
>> >     >>>>>>> via
>> >     >>>>>>>>>> through()).
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Thus, the idea is to make the topic name parameter of
>> through
>> >     >>>>>>> optional.
>> >     >>>>>>>>>> It's of course just an idea. Happy do have a other API
>> >     design. The
>> >     >>>>>>> goal
>> >     >>>>>>>>>> was, to avoid to many new overloads.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>>> Could you clarify exactly what you mean by keeping the
>> >     current
>> >     >>>>>>>>> distinction?
>> >     >>>>>>>>>> Current distinction is: user topics are created manually
>> >     and user
>> >     >>>>>>>>>> specifies the name -- internal topics are created by
>> >     Kafka Streams
>> >     >>>>>>> and
>> >     >>>>>>>>>> an name is generated automatically.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> -> through("user-topic")
>> >     >>>>>>>>>> -> through(TopicConfig.withNumberOfPartitions(5)) //
>> >     Streams creates
>> >     >>>>>>>> an
>> >     >>>>>>>>>> internal topic
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> -Matthias
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> On 11/6/17 6:56 PM, Thomas Becker wrote:
>> >     >>>>>>>>>>> Could you clarify exactly what you mean by keeping the
>> >     current
>> >     >>>>>>>>> distinction?
>> >     >>>>>>>>>>> Actually, re-reading the KIP and JIRA, it's not clear
>> >     that being
>> >     >>>>>>> able
>> >     >>>>>>>>> to specify a custom name is actually a requirement. If the
>> >     goal is to
>> >     >>>>>>>>> control repartitioning and tune parallelism, maybe we can
>> just
>> >     >>>>>>>>> sidestep
>> >     >>>>>>>>> this issue altogether by removing the ability to set a
>> >     different
>> >     >>>> name.
>> >     >>>>>>>>>>> On Mon, 2017-11-06 at 16:51 +0100, Matthias J. Sax
>> wrote:
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> That's a good point. In current design, we strictly
>> >     distinguish
>> >     >>>>>>> both.
>> >     >>>>>>>>>>> For example, the reset tools deletes internal topics
>> >     (starting with
>> >     >>>>>>>>>>> prefix `<application.id <http://application.id>>-` and
>> >     ending with either `-repartition`
>> >     >>>> or
>> >     >>>>>>>>>>> `-changelog`.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Thus, from my point of view, it would make sense to
>> keep the
>> >     >>>> current
>> >     >>>>>>>>>>> distinction.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> -Matthias
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> On 11/6/17 4:45 PM, Thomas Becker wrote:
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> I think this sounds good as well. It's worth clarifying
>> >     whether
>> >     >>>>>>> topics
>> >     >>>>>>>>> that are named by the user but created by streams are
>> >     considered
>> >     >>>>>>>> "internal"
>> >     >>>>>>>>> topics also.
>> >     >>>>>>>>>>> On Sun, 2017-11-05 at 23:02 +0100, Matthias J. Sax
>> wrote:
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> My idea was, to relax the requirement for through() that
>> >     a topic
>> >     >>>>>>> must
>> >     >>>>>>>> be
>> >     >>>>>>>>>>> created manually before startup.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Thus, if no through() call is made, a (internal) topic
>> >     is created
>> >     >>>>>>> the
>> >     >>>>>>>>>>> same way we do it currently.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> If one uses `through(String topicName)` we keep the
>> current
>> >     >>>> behavior
>> >     >>>>>>>> and
>> >     >>>>>>>>>>> require users to create the topic manually.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> The reasoning is as follows: if a user creates a topic
>> >     manually, a
>> >     >>>>>>>> user
>> >     >>>>>>>>>>> can just use it for repartitioning. As the topic is
>> >     already there,
>> >     >>>>>>>> there
>> >     >>>>>>>>>>> is no need to specify any topic configs.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> We add a new `through()` overload (details TBD) that
>> >     allows to
>> >     >>>>>>> specify
>> >     >>>>>>>>>>> topic configs and Streams create the topic with those
>> >     configs.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Reasoning: user don't want to manage topic manually,
>> >     thus, it's
>> >     >>>>>>> still
>> >     >>>>>>>> an
>> >     >>>>>>>>>>> internal topic and Streams create the topic name
>> >     automatically as
>> >     >>>>>>> for
>> >     >>>>>>>>>>> all other internal topics. However, users gets some more
>> >     control
>> >     >>>>>>> about
>> >     >>>>>>>>>>> topic parameters like number of partitions (we should
>> >     discuss what
>> >     >>>>>>>> other
>> >     >>>>>>>>>>> configs would be useful).
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Does this make sense?
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> -Matthias
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> On 11/5/17 1:21 AM, Jan Filipiak wrote:
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Hi.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Im not 100 % up to date what version 1.0 DSL looks like
>> ATM.
>> >     >>>>>>>>>>> I just would argue that repartitioning should be an own
>> >     API call
>> >     >>>>>>> like
>> >     >>>>>>>>>>> through or something.
>> >     >>>>>>>>>>> One can use through or to already to get this. I would
>> >     argue one
>> >     >>>>>>>> should
>> >     >>>>>>>>>>> look there instead of overloads
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Best Jan
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> On 04.11.2017 16:01, Jeyhun Karimov wrote:
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Dear community,
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> I would like to initiate discussion on KIP-221 [1] based
>> >     on issue
>> >     >>>>>>> [2].
>> >     >>>>>>>>>>> Please feel free to comment.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> [1]
>> >     >>>>>>>>>>>
>> >     >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> >     >>>>>>>> 221%3A+Repartition+Topic+Hints+in+Streams
>> >     >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/KAFKA-6037
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Cheers,
>> >     >>>>>>>>>>> Jeyhun
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> ________________________________
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> This email and any attachments may contain confidential
>> and
>> >     >>>>>>> privileged
>> >     >>>>>>>>> material for the sole use of the intended recipient. Any
>> >     review,
>> >     >>>>>>> copying,
>> >     >>>>>>>>> or distribution of this email (or any attachments) by
>> >     others is
>> >     >>>>>>>> prohibited.
>> >     >>>>>>>>> If you are not the intended recipient, please contact the
>> >     sender
>> >     >>>>>>>>> immediately and permanently delete this email and any
>> >     attachments. No
>> >     >>>>>>>>> employee or agent of TiVo Inc. is authorized to conclude
>> >     any binding
>> >     >>>>>>>>> agreement on behalf of TiVo Inc. by email. Binding
>> >     agreements with
>> >     >>>>>>>>> TiVo
>> >     >>>>>>>>> Inc. may only be made by a signed written agreement.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> ________________________________
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> This email and any attachments may contain confidential
>> and
>> >     >>>>>>> privileged
>> >     >>>>>>>>> material for the sole use of the intended recipient. Any
>> >     review,
>> >     >>>>>>> copying,
>> >     >>>>>>>>> or distribution of this email (or any attachments) by
>> >     others is
>> >     >>>>>>>> prohibited.
>> >     >>>>>>>>> If you are not the intended recipient, please contact the
>> >     sender
>> >     >>>>>>>>> immediately and permanently delete this email and any
>> >     attachments. No
>> >     >>>>>>>>> employee or agent of TiVo Inc. is authorized to conclude
>> >     any binding
>> >     >>>>>>>>> agreement on behalf of TiVo Inc. by email. Binding
>> >     agreements with
>> >     >>>>>>>>> TiVo
>> >     >>>>>>>>> Inc. may only be made by a signed written agreement.
>> >     >>>>>>>>>
>> >     >>>>>>
>> >     >>>>>>
>> >     >>>>>
>> >     >>>>
>> >     >>>>
>> >     >>>
>> >     >>
>> >     >
>> >
>>
>>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to