Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Jeyhun Karimov Sat, 18 Nov 2017 01:59:02 -0800

Hi,

I did not expected that Context will be this much an issue. Instead of
applying different semantics for different operators, I think we should
remove this feature completely.



Cheers,
Jeyhun
On Sat 18. Nov 2017 at 07:49, Jan Filipiak <[email protected]> wrote:

> Yes, the mail said only join so I wanted to clarify.
>
>
>
> On 17.11.2017 19:05, Matthias J. Sax wrote:
> > Yes. But I think an aggregation is an many-to-one operation, too.
> >
> > For the stripping off part: internally, we can just keep some record
> > context, but just do not allow users to access it (because the context
> > context does not make sense for them) by hiding the corresponding APIs.
> >
> >
> > -Matthias
> >
> > On 11/16/17 10:05 PM, Guozhang Wang wrote:
> >> Matthias,
> >>
> >> For this idea, are your proposing that for any many-to-one mapping
> >> operations (for now only Join operators), we will strip off the record
> >> context in the resulted records and claim "we cannot infer its traced
> >> context anymore"?
> >>
> >>
> >> Guozhang
> >>
> >>
> >> On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <[email protected]
> >
> >> wrote:
> >>
> >>> Any thoughts about my latest proposal?
> >>>
> >>> -Matthias
> >>>
> >>> On 11/10/17 10:02 PM, Jan Filipiak wrote:
> >>>> Hi,
> >>>>
> >>>> i think this is the better way. Naming is always tricky Source is
> kinda
> >>>> taken
> >>>> I had TopicBackedK[Source|Table] in mind
> >>>> but for the user its way better already IMHO
> >>>>
> >>>> Thank you for reconsideration
> >>>>
> >>>> Best Jan
> >>>>
> >>>>
> >>>> On 10.11.2017 22:48, Matthias J. Sax wrote:
> >>>>> I was thinking about the source stream/table idea once more and it
> seems
> >>>>> it would not be too hard to implement:
> >>>>>
> >>>>> We add two new classes
> >>>>>
> >>>>>     SourceKStream extends KStream
> >>>>>
> >>>>> and
> >>>>>
> >>>>>     SourceKTable extend KTable
> >>>>>
> >>>>> and return both from StreamsBuilder#stream and StreamsBuilder#table
> >>>>>
> >>>>> As both are sub-classes, this change is backward compatible. We
> change
> >>>>> the return type for any single-record transform to this new types,
> too,
> >>>>> and use KStream/KTable as return type for any multi-record operation.
> >>>>>
> >>>>> The new RecordContext API is added to both new classes. For old
> classes,
> >>>>> we only implement KIP-149 to get access to the key.
> >>>>>
> >>>>>
> >>>>> WDYT?
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>> On 11/9/17 9:13 PM, Jan Filipiak wrote:
> >>>>>> Okay,
> >>>>>>
> >>>>>> looks like it would _at least work_ for Cached KTableSources .
> >>>>>> But we make it harder to the user to make mistakes by putting
> >>>>>> features into places where they don't make sense and don't
> >>>>>> help anyone.
> >>>>>>
> >>>>>> I once again think that my suggestion is easier to implement and
> >>>>>> more correct. I will use this email to express my disagreement with
> the
> >>>>>> proposed KIP (-1 non binding of course) state that I am open for any
> >>>>>> questions
> >>>>>> regarding this. I will also do the usual thing and point out that
> the
> >>>>>> friends
> >>>>>> over at Hive got it correct aswell.
> >>>>>> One can not user their
> >>>>>> https://cwiki.apache.org/confluence/display/Hive/
> >>> LanguageManual+VirtualColumns
> >>>>>>
> >>>>>> in any place where its not read from the Sources.
> >>>>>>
> >>>>>> With KSQl in mind it makes me sad how this is evolving here.
> >>>>>>
> >>>>>> Best Jan
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 10.11.2017 01:06, Guozhang Wang wrote:
> >>>>>>> Hello Jan,
> >>>>>>>
> >>>>>>> Regarding your question about caching: today we keep the record
> >>> context
> >>>>>>> with the cached entry already so when we flush the cache which may
> >>>>>>> generate
> >>>>>>> new records forwarding we will set the record context
> appropriately;
> >>>>>>> and
> >>>>>>> then after the flush is completed we will reset the context to the
> >>>>>>> record
> >>>>>>> before the flush happens. But I think when Jeyhun did the PR it is
> a
> >>>>>>> good
> >>>>>>> time to double check on such stages to make sure we are not
> >>>>>>> introducing any
> >>>>>>> regressions.
> >>>>>>>
> >>>>>>>
> >>>>>>> Guozhang
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <
> >>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I Aggree completely.
> >>>>>>>>
> >>>>>>>> Exposing this information in a place where it has no _natural_
> >>>>>>>> belonging
> >>>>>>>> might really be a bad blocker in the long run.
> >>>>>>>>
> >>>>>>>> Concerning your first point. I would argue its not to hard to
> have a
> >>>>>>>> user
> >>>>>>>> keep track of these. If we still don't want the user
> >>>>>>>> to keep track of these I would argue that all > projection only <
> >>>>>>>> transformations on a Source-backed KTable/KStream
> >>>>>>>> could also return a Ktable/KStream instance of the type we return
> >>>>>>>> from the
> >>>>>>>> topology builder.
> >>>>>>>> Only after any operation that exceeds projection or filter one
> would
> >>>>>>>> return a KTable not granting access to this any longer.
> >>>>>>>>
> >>>>>>>> Even then its difficult already: I never ran a topology with
> caching
> >>>>>>>> but I
> >>>>>>>> am not even 100% sure what the record Context means behind
> >>>>>>>> a materialized KTable with Caching? Topic and Partition are
> probably
> >>>>>>>> with
> >>>>>>>> some reasoning but offset is probably only the offset causing the
> >>>>>>>> flush?
> >>>>>>>> So one might aswell think to drop offsets from this RecordContext.
> >>>>>>>>
> >>>>>>>> Best Jan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 07.11.2017 03:18, Guozhang Wang wrote:
> >>>>>>>>
> >>>>>>>>> Regarding the API design (the proposed set of overloads v.s. one
> >>>>>>>>> overload
> >>>>>>>>> on #map to enrich the record), I think what we have represents a
> >>> good
> >>>>>>>>> trade-off between API succinctness and user convenience: on one
> >>>>>>>>> hand we
> >>>>>>>>> definitely want to keep as fewer overloaded functions as
> possible.
> >>>>>>>>> But on
> >>>>>>>>> the other hand if we only do that in, say, the #map() function
> then
> >>>>>>>>> this
> >>>>>>>>> enrichment could be an overkill: think of a topology that has 7
> >>>>>>>>> operators
> >>>>>>>>> in a chain, where users want to access the record context on
> >>>>>>>>> operator #2
> >>>>>>>>> and #6 only, with the "enrichment" manner they need to do the
> >>>>>>>>> enrichment
> >>>>>>>>> on
> >>>>>>>>> operator #2 and keep it that way until #6. In addition, the
> >>>>>>>>> RecordContext
> >>>>>>>>> fields (topic, offset, etc) are really orthogonal to the
> key-value
> >>>>>>>>> payloads
> >>>>>>>>> themselves, so I think separating them into this object is a
> cleaner
> >>>>>>>>> way.
> >>>>>>>>>
> >>>>>>>>> Regarding the RecordContext inheritance, this is actually a good
> >>>>>>>>> point
> >>>>>>>>> that
> >>>>>>>>> have not been discussed thoroughly before. Here are my my two
> cents:
> >>>>>>>>> one
> >>>>>>>>> natural way would be to inherit the record context from the
> >>>>>>>>> "triggering"
> >>>>>>>>> record, for example in a join operator, if the record from
> stream A
> >>>>>>>>> triggers the join then the record context is inherited from with
> >>> that
> >>>>>>>>> record. This is also aligned with the lower-level PAPI
> interface. A
> >>>>>>>>> counter
> >>>>>>>>> argument, though, would be that this is sort of leaking the
> internal
> >>>>>>>>> implementations of the DSL, so that moving forward if we did some
> >>>>>>>>> refactoring to our join implementations so that the triggering
> >>>>>>>>> record can
> >>>>>>>>> change, the RecordContext would also be different. I do not know
> how
> >>>>>>>>> much
> >>>>>>>>> it would really affect end users, but would like to hear your
> >>>>>>>>> opinions.
> >>>>>>>>>
> >>>>>>>> Agreed to 100% exposing this information
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <
> >>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Jan,
> >>>>>>>>>> Sorry for late reply.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The API Design doesn't look appealing
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> In terms of API design we tried to preserve the java functional
> >>>>>>>>>> interfaces.
> >>>>>>>>>> We applied the same set of rich methods for KTable to make it
> >>>>>>>>>> compatible
> >>>>>>>>>> with the rest of overloaded APIs.
> >>>>>>>>>>
> >>>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is
> >>>>>>>>>> directly
> >>>>>>>>>>
> >>>>>>>>>>> feed from a topic with 1 additional overload for the #map()
> >>>>>>>>>>> methods to
> >>>>>>>>>>> cover every usecase while keeping the API in a way better
> state.
> >>>>>>>>>>>
> >>>>>>>>>> - IMO this seems a workaround, rather than a direct solution.
> >>>>>>>>>>
> >>>>>>>>>> Perhaps we should continue this discussion in DISCUSS thread.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Jeyhun
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak
> >>>>>>>>>> <[email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi.
> >>>>>>>>>>> I do understand that it might come in Handy.
> >>>>>>>>>>>      From my POV in any relational algebra this is only a
> >>>>>>>>>>> projection.
> >>>>>>>>>>> Currently we hide these "fields" that come with the input
> record.
> >>>>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that
> is
> >>>>>>>>>>> directly
> >>>>>>>>>>> feed from a topic with 1 additional overload for the #map()
> >>>>>>>>>>> methods to
> >>>>>>>>>>> cover every usecase while keeping the API in a way better
> state.
> >>>>>>>>>>>
> >>>>>>>>>>> best Jan
> >>>>>>>>>>>
> >>>>>>>>>>> On 06.11.2017 17:52, Matthias J. Sax wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Jan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I understand what you are saying. However, having a
> >>>>>>>>>>>> RecordContext is
> >>>>>>>>>>>> super useful for operations that are applied to input topic.
> Many
> >>>>>>>>>>>> users
> >>>>>>>>>>>> requested this feature -- it's much more convenient that
> falling
> >>>>>>>>>>>> back
> >>>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>> transform() to implement a a filter() for example that want to
> >>>>>>>>>>> access
> >>>>>>>>>>>> some meta data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Because we cannot distinguish different "origins" of a
> >>>>>>>>>>>> KStream/KTable,
> >>>>>>>>>>>>
> >>>>>>>>>>> I
> >>>>>>>>>>> am not sure if there would be a better way to do this. The only
> >>>>>>>>>>>> "workaround" I see, is to have two KStream/KTable interfaces
> each
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>> we
> >>>>>>>>>>> would use the first one for KStream/KTable with "proper"
> >>>>>>>>>>> RecordContext.
> >>>>>>>>>>>> But this does not seem to be a good solution either.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note, a KTable can also be read directly from a topic, I agree
> >>>>>>>>>>>> that
> >>>>>>>>>>>> using RecordContext on a KTable that is the result of an
> >>>>>>>>>>>> aggregation is
> >>>>>>>>>>>> questionable. But I don't see a reason to down vote the KIP
> for
> >>>>>>>>>>>> this
> >>>>>>>>>>>>
> >>>>>>>>>>> reason.
> >>>>>>>>>>>
> >>>>>>>>>>>> WDYT about this?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> -1 non binding
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I don't get the motivation.
> >>>>>>>>>>>>> In 80% of my DSL processors there is no such thing as a
> >>>>>>>>>>>>> reasonable
> >>>>>>>>>>>>> RecordContext.
> >>>>>>>>>>>>> After a join  the record I am processing belongs to at least
> 2
> >>>>>>>>>>>>> topics.
> >>>>>>>>>>>>> After a Group by the record I am processing was created from
> >>>>>>>>>>>>> multiple
> >>>>>>>>>>>>> offsets.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The API Design doesn't look appealing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best Jan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Dear community,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It seems the discussion for KIP-159 [1] converged finally. I
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>>> like to
> >>>>>>>>>>>>>> initiate voting for the particular KIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> 159%3A+Introducing+Rich+functions+to+Streams
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Jeyhun
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>
> >>
>
>

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Reply via email to