Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Jeyhun Karimov Tue, 07 Nov 2017 04:51:36 -0800

Hi Jan,

Thanks for your comments. I agree that the implementation should not
introduce new "bugs" or "known issues" in future.
I think  we can either i) just drop RecordContext argument for join methods
or ii) introduce binary aggregation logic for RecordContexts for
two-input-stream-operators.
Any other comments/suggestions of course are welcome.



Cheers,
Jeyhun

On Tue, Nov 7, 2017 at 1:04 PM Jan Filipiak <[email protected]>
wrote:

>
> On 07.11.2017 12:59, Jan Filipiak wrote:
> >
> > On 07.11.2017 11:20, Matthias J. Sax wrote:
> >> About implementation if we do the KIP as proposed: I agree with Guozhang
> >> that we would need to use the currently processed record's metadata in
> >> the context. This does leak some implementation details, but I
> >> personally don't see a big issue here (at the same time, I am also fine
> >> to remove the RecordContext for joins if people think it's an issue).
> >>
> >> About the API: while I agree with Jan, that having two APIs for input
> >> streams/tables and "derived" streams/table (ie, result of
> >> KStream-KStream join or an aggregation) would be a way to avoid some
> >> semantic issue, I am not sure if it is worth the effort. IMHO, it would
> >> make the API more convoluted and if users access the RecordContext on a
> >> derived stream/table it's a "user error"
> > Why make it easy for the users to make mistakes in order to save some
> > effort
> > (That I dont quite think is that big actually)
> >> -- it's not really wrong as
> >> users still get the current records context, but of course, we would
> >> leak implementation details (as above, I don't see a bit issue here
> >> though).
> >>
> >> At the same time, I disagree with Jan that "its not to hard to have a
> >> user keeping track" -- if we apply this argument, we could even argue
> >> that it's not to hard to use a Transformer instead of a map/filter etc.
> >> We want to add "syntactic sugar" with this change and thus should really
> >> provide value and not introduce a half-baked solution for which users
> >> still need to do manual customizing.
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 11/7/17 5:54 AM, Jan Filipiak wrote:
> >>> I Aggree completely.
> >>>
> >>> Exposing this information in a place where it has no _natural_
> >>> belonging
> >>> might really be a bad blocker in the long run.
> >>>
> >>> Concerning your first point. I would argue its not to hard to have a
> >>> user keep track of these. If we still don't want the user
> >>> to keep track of these I would argue that all > projection only <
> >>> transformations on a Source-backed KTable/KStream
> >>> could also return a Ktable/KStream instance of the type we return from
> >>> the topology builder.
> >>> Only after any operation that exceeds projection or filter one would
> >>> return a KTable not granting access to this any longer.
> >>>
> >>> Even then its difficult already: I never ran a topology with caching
> >>> but
> >>> I am not even 100% sure what the record Context means behind
> >>> a materialized KTable with Caching? Topic and Partition are probably
> >>> with some reasoning but offset is probably only the offset causing the
> >>> flush?
> >>> So one might aswell think to drop offsets from this RecordContext.
> >>>
> >>> Best Jan
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 07.11.2017 03:18, Guozhang Wang wrote:
> >>>> Regarding the API design (the proposed set of overloads v.s. one
> >>>> overload
> >>>> on #map to enrich the record), I think what we have represents a good
> >>>> trade-off between API succinctness and user convenience: on one
> >>>> hand we
> >>>> definitely want to keep as fewer overloaded functions as possible.
> >>>> But on
> >>>> the other hand if we only do that in, say, the #map() function then
> >>>> this
> >>>> enrichment could be an overkill: think of a topology that has 7
> >>>> operators
> >>>> in a chain, where users want to access the record context on
> >>>> operator #2
> >>>> and #6 only, with the "enrichment" manner they need to do the
> >>>> enrichment on
> >>>> operator #2 and keep it that way until #6. In addition, the
> >>>> RecordContext
> >>>> fields (topic, offset, etc) are really orthogonal to the key-value
> >>>> payloads
> >>>> themselves, so I think separating them into this object is a
> >>>> cleaner way.
> >>>>
> >>>> Regarding the RecordContext inheritance, this is actually a good point
> >>>> that
> >>>> have not been discussed thoroughly before. Here are my my two
> >>>> cents: one
> >>>> natural way would be to inherit the record context from the
> >>>> "triggering"
> >>>> record, for example in a join operator, if the record from stream A
> >>>> triggers the join then the record context is inherited from with that
> >>>> record. This is also aligned with the lower-level PAPI interface. A
> >>>> counter
> >>>> argument, though, would be that this is sort of leaking the internal
> >>>> implementations of the DSL, so that moving forward if we did some
> >>>> refactoring to our join implementations so that the triggering
> >>>> record can
> >>>> change, the RecordContext would also be different. I do not know
> >>>> how much
> >>>> it would really affect end users, but would like to hear your
> >>>> opinions.
> >>> Agreed to 100% exposing this information
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi Jan,
> >>>>>
> >>>>> Sorry for late reply.
> >>>>>
> >>>>>
> >>>>> The API Design doesn't look appealing
> >>>>>
> >>>>>
> >>>>> In terms of API design we tried to preserve the java functional
> >>>>> interfaces.
> >>>>> We applied the same set of rich methods for KTable to make it
> >>>>> compatible
> >>>>> with the rest of overloaded APIs.
> >>>>>
> >>>>> It should be 100% sufficient to offer a KTable + KStream that is
> >>>>> directly
> >>>>>> feed from a topic with 1 additional overload for the #map()
> >>>>>> methods to
> >>>>>> cover every usecase while keeping the API in a way better state.
> >>>>> - IMO this seems a workaround, rather than a direct solution.
> >>>>>
> >>>>> Perhaps we should continue this discussion in DISCUSS thread.
> >>>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Jeyhun
> >>>>>
> >>>>>
> >>>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak
> >>>>> <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi.
> >>>>>>
> >>>>>> I do understand that it might come in Handy.
> >>>>>>    From my POV in any relational algebra this is only a projection.
> >>>>>> Currently we hide these "fields" that come with the input record.
> >>>>>> It should be 100% sufficient to offer a KTable + KStream that is
> >>>>>> directly
> >>>>>> feed from a topic with 1 additional overload for the #map()
> >>>>>> methods to
> >>>>>> cover every usecase while keeping the API in a way better state.
> >>>>>>
> >>>>>> best Jan
> >>>>>>
> >>>>>> On 06.11.2017 17:52, Matthias J. Sax wrote:
> >>>>>>> Jan,
> >>>>>>>
> >>>>>>> I understand what you are saying. However, having a
> >>>>>>> RecordContext is
> >>>>>>> super useful for operations that are applied to input topic. Many
> >>>>>>> users
> >>>>>>> requested this feature -- it's much more convenient that falling
> >>>>>>> back
> >>>>> to
> >>>>>>> transform() to implement a a filter() for example that want to
> >>>>>>> access
> >>>>>>> some meta data.
> >>>>>>>
> >>>>>>> Because we cannot distinguish different "origins" of a
> >>>>>>> KStream/KTable,
> >>>>> I
> >>>>>>> am not sure if there would be a better way to do this. The only
> >>>>>>> "workaround" I see, is to have two KStream/KTable interfaces
> >>>>>>> each and
> >>>>> we
> >>>>>>> would use the first one for KStream/KTable with "proper"
> >>>>>>> RecordContext.
> >>>>>>> But this does not seem to be a good solution either.
> >>>>>>>
> >>>>>>> Note, a KTable can also be read directly from a topic, I agree that
> >>>>>>> using RecordContext on a KTable that is the result of an
> >>>>>>> aggregation is
> >>>>>>> questionable. But I don't see a reason to down vote the KIP for
> >>>>>>> this
> >>>>>> reason.
> >>>>>>> WDYT about this?
> >>>>>>>
> >>>>>>>
> >>>>>>> -Matthias
> >>>>>>>
> >>>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote:
> >>>>>>>> -1 non binding
> >>>>>>>>
> >>>>>>>> I don't get the motivation.
> >>>>>>>> In 80% of my DSL processors there is no such thing as a reasonable
> >>>>>>>> RecordContext.
> >>>>>>>> After a join  the record I am processing belongs to at least 2
> >>>>>>>> topics.
> >>>>>>>> After a Group by the record I am processing was created from
> >>>>>>>> multiple
> >>>>>>>> offsets.
> >>>>>>>>
> >>>>>>>> The API Design doesn't look appealing
> >>>>>>>>
> >>>>>>>> Best Jan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote:
> >>>>>>>>> Dear community,
> >>>>>>>>>
> >>>>>>>>> It seems the discussion for KIP-159 [1] converged finally. I
> >>>>>>>>> would
> >>>>>>>>> like to
> >>>>>>>>> initiate voting for the particular KIP.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>> 159%3A+Introducing+Rich+functions+to+Streams
> >>>>>>>>> Cheers,
> >>>>>>>>> Jeyhun
> >>>>>>>>>
> >>>>
> >
>
>

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Reply via email to