Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Matthias J. Sax Fri, 17 Nov 2017 10:06:06 -0800

Yes. But I think an aggregation is an many-to-one operation, too.

For the stripping off part: internally, we can just keep some record
context, but just do not allow users to access it (because the context
context does not make sense for them) by hiding the corresponding APIs.



-Matthias

On 11/16/17 10:05 PM, Guozhang Wang wrote:
> Matthias,
> 
> For this idea, are your proposing that for any many-to-one mapping
> operations (for now only Join operators), we will strip off the record
> context in the resulted records and claim "we cannot infer its traced
> context anymore"?
> 
> 
> Guozhang
> 
> 
> On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <[email protected]>
> wrote:
> 
>> Any thoughts about my latest proposal?
>>
>> -Matthias
>>
>> On 11/10/17 10:02 PM, Jan Filipiak wrote:
>>> Hi,
>>>
>>> i think this is the better way. Naming is always tricky Source is kinda
>>> taken
>>> I had TopicBackedK[Source|Table] in mind
>>> but for the user its way better already IMHO
>>>
>>> Thank you for reconsideration
>>>
>>> Best Jan
>>>
>>>
>>> On 10.11.2017 22:48, Matthias J. Sax wrote:
>>>> I was thinking about the source stream/table idea once more and it seems
>>>> it would not be too hard to implement:
>>>>
>>>> We add two new classes
>>>>
>>>>    SourceKStream extends KStream
>>>>
>>>> and
>>>>
>>>>    SourceKTable extend KTable
>>>>
>>>> and return both from StreamsBuilder#stream and StreamsBuilder#table
>>>>
>>>> As both are sub-classes, this change is backward compatible. We change
>>>> the return type for any single-record transform to this new types, too,
>>>> and use KStream/KTable as return type for any multi-record operation.
>>>>
>>>> The new RecordContext API is added to both new classes. For old classes,
>>>> we only implement KIP-149 to get access to the key.
>>>>
>>>>
>>>> WDYT?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 11/9/17 9:13 PM, Jan Filipiak wrote:
>>>>> Okay,
>>>>>
>>>>> looks like it would _at least work_ for Cached KTableSources .
>>>>> But we make it harder to the user to make mistakes by putting
>>>>> features into places where they don't make sense and don't
>>>>> help anyone.
>>>>>
>>>>> I once again think that my suggestion is easier to implement and
>>>>> more correct. I will use this email to express my disagreement with the
>>>>> proposed KIP (-1 non binding of course) state that I am open for any
>>>>> questions
>>>>> regarding this. I will also do the usual thing and point out that the
>>>>> friends
>>>>> over at Hive got it correct aswell.
>>>>> One can not user their
>>>>> https://cwiki.apache.org/confluence/display/Hive/
>> LanguageManual+VirtualColumns
>>>>>
>>>>>
>>>>> in any place where its not read from the Sources.
>>>>>
>>>>> With KSQl in mind it makes me sad how this is evolving here.
>>>>>
>>>>> Best Jan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 10.11.2017 01:06, Guozhang Wang wrote:
>>>>>> Hello Jan,
>>>>>>
>>>>>> Regarding your question about caching: today we keep the record
>> context
>>>>>> with the cached entry already so when we flush the cache which may
>>>>>> generate
>>>>>> new records forwarding we will set the record context appropriately;
>>>>>> and
>>>>>> then after the flush is completed we will reset the context to the
>>>>>> record
>>>>>> before the flush happens. But I think when Jeyhun did the PR it is a
>>>>>> good
>>>>>> time to double check on such stages to make sure we are not
>>>>>> introducing any
>>>>>> regressions.
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <
>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I Aggree completely.
>>>>>>>
>>>>>>> Exposing this information in a place where it has no _natural_
>>>>>>> belonging
>>>>>>> might really be a bad blocker in the long run.
>>>>>>>
>>>>>>> Concerning your first point. I would argue its not to hard to have a
>>>>>>> user
>>>>>>> keep track of these. If we still don't want the user
>>>>>>> to keep track of these I would argue that all > projection only <
>>>>>>> transformations on a Source-backed KTable/KStream
>>>>>>> could also return a Ktable/KStream instance of the type we return
>>>>>>> from the
>>>>>>> topology builder.
>>>>>>> Only after any operation that exceeds projection or filter one would
>>>>>>> return a KTable not granting access to this any longer.
>>>>>>>
>>>>>>> Even then its difficult already: I never ran a topology with caching
>>>>>>> but I
>>>>>>> am not even 100% sure what the record Context means behind
>>>>>>> a materialized KTable with Caching? Topic and Partition are probably
>>>>>>> with
>>>>>>> some reasoning but offset is probably only the offset causing the
>>>>>>> flush?
>>>>>>> So one might aswell think to drop offsets from this RecordContext.
>>>>>>>
>>>>>>> Best Jan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 07.11.2017 03:18, Guozhang Wang wrote:
>>>>>>>
>>>>>>>> Regarding the API design (the proposed set of overloads v.s. one
>>>>>>>> overload
>>>>>>>> on #map to enrich the record), I think what we have represents a
>> good
>>>>>>>> trade-off between API succinctness and user convenience: on one
>>>>>>>> hand we
>>>>>>>> definitely want to keep as fewer overloaded functions as possible.
>>>>>>>> But on
>>>>>>>> the other hand if we only do that in, say, the #map() function then
>>>>>>>> this
>>>>>>>> enrichment could be an overkill: think of a topology that has 7
>>>>>>>> operators
>>>>>>>> in a chain, where users want to access the record context on
>>>>>>>> operator #2
>>>>>>>> and #6 only, with the "enrichment" manner they need to do the
>>>>>>>> enrichment
>>>>>>>> on
>>>>>>>> operator #2 and keep it that way until #6. In addition, the
>>>>>>>> RecordContext
>>>>>>>> fields (topic, offset, etc) are really orthogonal to the key-value
>>>>>>>> payloads
>>>>>>>> themselves, so I think separating them into this object is a cleaner
>>>>>>>> way.
>>>>>>>>
>>>>>>>> Regarding the RecordContext inheritance, this is actually a good
>>>>>>>> point
>>>>>>>> that
>>>>>>>> have not been discussed thoroughly before. Here are my my two cents:
>>>>>>>> one
>>>>>>>> natural way would be to inherit the record context from the
>>>>>>>> "triggering"
>>>>>>>> record, for example in a join operator, if the record from stream A
>>>>>>>> triggers the join then the record context is inherited from with
>> that
>>>>>>>> record. This is also aligned with the lower-level PAPI interface. A
>>>>>>>> counter
>>>>>>>> argument, though, would be that this is sort of leaking the internal
>>>>>>>> implementations of the DSL, so that moving forward if we did some
>>>>>>>> refactoring to our join implementations so that the triggering
>>>>>>>> record can
>>>>>>>> change, the RecordContext would also be different. I do not know how
>>>>>>>> much
>>>>>>>> it would really affect end users, but would like to hear your
>>>>>>>> opinions.
>>>>>>>>
>>>>>>> Agreed to 100% exposing this information
>>>>>>>
>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <
>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Jan,
>>>>>>>>> Sorry for late reply.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The API Design doesn't look appealing
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In terms of API design we tried to preserve the java functional
>>>>>>>>> interfaces.
>>>>>>>>> We applied the same set of rich methods for KTable to make it
>>>>>>>>> compatible
>>>>>>>>> with the rest of overloaded APIs.
>>>>>>>>>
>>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is
>>>>>>>>> directly
>>>>>>>>>
>>>>>>>>>> feed from a topic with 1 additional overload for the #map()
>>>>>>>>>> methods to
>>>>>>>>>> cover every usecase while keeping the API in a way better state.
>>>>>>>>>>
>>>>>>>>> - IMO this seems a workaround, rather than a direct solution.
>>>>>>>>>
>>>>>>>>> Perhaps we should continue this discussion in DISCUSS thread.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Jeyhun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak
>>>>>>>>> <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>>> I do understand that it might come in Handy.
>>>>>>>>>>     From my POV in any relational algebra this is only a
>>>>>>>>>> projection.
>>>>>>>>>> Currently we hide these "fields" that come with the input record.
>>>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is
>>>>>>>>>> directly
>>>>>>>>>> feed from a topic with 1 additional overload for the #map()
>>>>>>>>>> methods to
>>>>>>>>>> cover every usecase while keeping the API in a way better state.
>>>>>>>>>>
>>>>>>>>>> best Jan
>>>>>>>>>>
>>>>>>>>>> On 06.11.2017 17:52, Matthias J. Sax wrote:
>>>>>>>>>>
>>>>>>>>>>> Jan,
>>>>>>>>>>>
>>>>>>>>>>> I understand what you are saying. However, having a
>>>>>>>>>>> RecordContext is
>>>>>>>>>>> super useful for operations that are applied to input topic. Many
>>>>>>>>>>> users
>>>>>>>>>>> requested this feature -- it's much more convenient that falling
>>>>>>>>>>> back
>>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>> transform() to implement a a filter() for example that want to
>>>>>>>>>> access
>>>>>>>>>>> some meta data.
>>>>>>>>>>>
>>>>>>>>>>> Because we cannot distinguish different "origins" of a
>>>>>>>>>>> KStream/KTable,
>>>>>>>>>>>
>>>>>>>>>> I
>>>>>>>>>> am not sure if there would be a better way to do this. The only
>>>>>>>>>>> "workaround" I see, is to have two KStream/KTable interfaces each
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>> would use the first one for KStream/KTable with "proper"
>>>>>>>>>> RecordContext.
>>>>>>>>>>> But this does not seem to be a good solution either.
>>>>>>>>>>>
>>>>>>>>>>> Note, a KTable can also be read directly from a topic, I agree
>>>>>>>>>>> that
>>>>>>>>>>> using RecordContext on a KTable that is the result of an
>>>>>>>>>>> aggregation is
>>>>>>>>>>> questionable. But I don't see a reason to down vote the KIP for
>>>>>>>>>>> this
>>>>>>>>>>>
>>>>>>>>>> reason.
>>>>>>>>>>
>>>>>>>>>>> WDYT about this?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote:
>>>>>>>>>>>
>>>>>>>>>>>> -1 non binding
>>>>>>>>>>>>
>>>>>>>>>>>> I don't get the motivation.
>>>>>>>>>>>> In 80% of my DSL processors there is no such thing as a
>>>>>>>>>>>> reasonable
>>>>>>>>>>>> RecordContext.
>>>>>>>>>>>> After a join  the record I am processing belongs to at least 2
>>>>>>>>>>>> topics.
>>>>>>>>>>>> After a Group by the record I am processing was created from
>>>>>>>>>>>> multiple
>>>>>>>>>>>> offsets.
>>>>>>>>>>>>
>>>>>>>>>>>> The API Design doesn't look appealing
>>>>>>>>>>>>
>>>>>>>>>>>> Best Jan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear community,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems the discussion for KIP-159 [1] converged finally. I
>>>>>>>>>>>>> would
>>>>>>>>>>>>> like to
>>>>>>>>>>>>> initiate voting for the particular KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>
>>>>>>>>>>>> 159%3A+Introducing+Rich+functions+to+Streams
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Jeyhun
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>
>>
>>
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Reply via email to