Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Matthias J. Sax Thu, 16 Nov 2017 13:03:56 -0800

Any thoughts about my latest proposal?

-Matthias


On 11/10/17 10:02 PM, Jan Filipiak wrote:
> Hi,
> 
> i think this is the better way. Naming is always tricky Source is kinda
> taken
> I had TopicBackedK[Source|Table] in mind
> but for the user its way better already IMHO
> 
> Thank you for reconsideration
> 
> Best Jan
> 
> 
> On 10.11.2017 22:48, Matthias J. Sax wrote:
>> I was thinking about the source stream/table idea once more and it seems
>> it would not be too hard to implement:
>>
>> We add two new classes
>>
>>    SourceKStream extends KStream
>>
>> and
>>
>>    SourceKTable extend KTable
>>
>> and return both from StreamsBuilder#stream and StreamsBuilder#table
>>
>> As both are sub-classes, this change is backward compatible. We change
>> the return type for any single-record transform to this new types, too,
>> and use KStream/KTable as return type for any multi-record operation.
>>
>> The new RecordContext API is added to both new classes. For old classes,
>> we only implement KIP-149 to get access to the key.
>>
>>
>> WDYT?
>>
>>
>> -Matthias
>>
>> On 11/9/17 9:13 PM, Jan Filipiak wrote:
>>> Okay,
>>>
>>> looks like it would _at least work_ for Cached KTableSources .
>>> But we make it harder to the user to make mistakes by putting
>>> features into places where they don't make sense and don't
>>> help anyone.
>>>
>>> I once again think that my suggestion is easier to implement and
>>> more correct. I will use this email to express my disagreement with the
>>> proposed KIP (-1 non binding of course) state that I am open for any
>>> questions
>>> regarding this. I will also do the usual thing and point out that the
>>> friends
>>> over at Hive got it correct aswell.
>>> One can not user their
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
>>>
>>>
>>> in any place where its not read from the Sources.
>>>
>>> With KSQl in mind it makes me sad how this is evolving here.
>>>
>>> Best Jan
>>>
>>>
>>>
>>>
>>>
>>> On 10.11.2017 01:06, Guozhang Wang wrote:
>>>> Hello Jan,
>>>>
>>>> Regarding your question about caching: today we keep the record context
>>>> with the cached entry already so when we flush the cache which may
>>>> generate
>>>> new records forwarding we will set the record context appropriately;
>>>> and
>>>> then after the flush is completed we will reset the context to the
>>>> record
>>>> before the flush happens. But I think when Jeyhun did the PR it is a
>>>> good
>>>> time to double check on such stages to make sure we are not
>>>> introducing any
>>>> regressions.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <jan.filip...@trivago.com>
>>>> wrote:
>>>>
>>>>> I Aggree completely.
>>>>>
>>>>> Exposing this information in a place where it has no _natural_
>>>>> belonging
>>>>> might really be a bad blocker in the long run.
>>>>>
>>>>> Concerning your first point. I would argue its not to hard to have a
>>>>> user
>>>>> keep track of these. If we still don't want the user
>>>>> to keep track of these I would argue that all > projection only <
>>>>> transformations on a Source-backed KTable/KStream
>>>>> could also return a Ktable/KStream instance of the type we return
>>>>> from the
>>>>> topology builder.
>>>>> Only after any operation that exceeds projection or filter one would
>>>>> return a KTable not granting access to this any longer.
>>>>>
>>>>> Even then its difficult already: I never ran a topology with caching
>>>>> but I
>>>>> am not even 100% sure what the record Context means behind
>>>>> a materialized KTable with Caching? Topic and Partition are probably
>>>>> with
>>>>> some reasoning but offset is probably only the offset causing the
>>>>> flush?
>>>>> So one might aswell think to drop offsets from this RecordContext.
>>>>>
>>>>> Best Jan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 07.11.2017 03:18, Guozhang Wang wrote:
>>>>>
>>>>>> Regarding the API design (the proposed set of overloads v.s. one
>>>>>> overload
>>>>>> on #map to enrich the record), I think what we have represents a good
>>>>>> trade-off between API succinctness and user convenience: on one
>>>>>> hand we
>>>>>> definitely want to keep as fewer overloaded functions as possible.
>>>>>> But on
>>>>>> the other hand if we only do that in, say, the #map() function then
>>>>>> this
>>>>>> enrichment could be an overkill: think of a topology that has 7
>>>>>> operators
>>>>>> in a chain, where users want to access the record context on
>>>>>> operator #2
>>>>>> and #6 only, with the "enrichment" manner they need to do the
>>>>>> enrichment
>>>>>> on
>>>>>> operator #2 and keep it that way until #6. In addition, the
>>>>>> RecordContext
>>>>>> fields (topic, offset, etc) are really orthogonal to the key-value
>>>>>> payloads
>>>>>> themselves, so I think separating them into this object is a cleaner
>>>>>> way.
>>>>>>
>>>>>> Regarding the RecordContext inheritance, this is actually a good
>>>>>> point
>>>>>> that
>>>>>> have not been discussed thoroughly before. Here are my my two cents:
>>>>>> one
>>>>>> natural way would be to inherit the record context from the
>>>>>> "triggering"
>>>>>> record, for example in a join operator, if the record from stream A
>>>>>> triggers the join then the record context is inherited from with that
>>>>>> record. This is also aligned with the lower-level PAPI interface. A
>>>>>> counter
>>>>>> argument, though, would be that this is sort of leaking the internal
>>>>>> implementations of the DSL, so that moving forward if we did some
>>>>>> refactoring to our join implementations so that the triggering
>>>>>> record can
>>>>>> change, the RecordContext would also be different. I do not know how
>>>>>> much
>>>>>> it would really affect end users, but would like to hear your
>>>>>> opinions.
>>>>>>
>>>>> Agreed to 100% exposing this information
>>>>>
>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <je.kari...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Jan,
>>>>>>> Sorry for late reply.
>>>>>>>
>>>>>>>
>>>>>>> The API Design doesn't look appealing
>>>>>>>
>>>>>>>
>>>>>>> In terms of API design we tried to preserve the java functional
>>>>>>> interfaces.
>>>>>>> We applied the same set of rich methods for KTable to make it
>>>>>>> compatible
>>>>>>> with the rest of overloaded APIs.
>>>>>>>
>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is
>>>>>>> directly
>>>>>>>
>>>>>>>> feed from a topic with 1 additional overload for the #map()
>>>>>>>> methods to
>>>>>>>> cover every usecase while keeping the API in a way better state.
>>>>>>>>
>>>>>>> - IMO this seems a workaround, rather than a direct solution.
>>>>>>>
>>>>>>> Perhaps we should continue this discussion in DISCUSS thread.
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jeyhun
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak
>>>>>>> <jan.filip...@trivago.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi.
>>>>>>>> I do understand that it might come in Handy.
>>>>>>>>     From my POV in any relational algebra this is only a
>>>>>>>> projection.
>>>>>>>> Currently we hide these "fields" that come with the input record.
>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is
>>>>>>>> directly
>>>>>>>> feed from a topic with 1 additional overload for the #map()
>>>>>>>> methods to
>>>>>>>> cover every usecase while keeping the API in a way better state.
>>>>>>>>
>>>>>>>> best Jan
>>>>>>>>
>>>>>>>> On 06.11.2017 17:52, Matthias J. Sax wrote:
>>>>>>>>
>>>>>>>>> Jan,
>>>>>>>>>
>>>>>>>>> I understand what you are saying. However, having a
>>>>>>>>> RecordContext is
>>>>>>>>> super useful for operations that are applied to input topic. Many
>>>>>>>>> users
>>>>>>>>> requested this feature -- it's much more convenient that falling
>>>>>>>>> back
>>>>>>>>>
>>>>>>>> to
>>>>>>>> transform() to implement a a filter() for example that want to
>>>>>>>> access
>>>>>>>>> some meta data.
>>>>>>>>>
>>>>>>>>> Because we cannot distinguish different "origins" of a
>>>>>>>>> KStream/KTable,
>>>>>>>>>
>>>>>>>> I
>>>>>>>> am not sure if there would be a better way to do this. The only
>>>>>>>>> "workaround" I see, is to have two KStream/KTable interfaces each
>>>>>>>>> and
>>>>>>>>>
>>>>>>>> we
>>>>>>>> would use the first one for KStream/KTable with "proper"
>>>>>>>> RecordContext.
>>>>>>>>> But this does not seem to be a good solution either.
>>>>>>>>>
>>>>>>>>> Note, a KTable can also be read directly from a topic, I agree
>>>>>>>>> that
>>>>>>>>> using RecordContext on a KTable that is the result of an
>>>>>>>>> aggregation is
>>>>>>>>> questionable. But I don't see a reason to down vote the KIP for
>>>>>>>>> this
>>>>>>>>>
>>>>>>>> reason.
>>>>>>>>
>>>>>>>>> WDYT about this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote:
>>>>>>>>>
>>>>>>>>>> -1 non binding
>>>>>>>>>>
>>>>>>>>>> I don't get the motivation.
>>>>>>>>>> In 80% of my DSL processors there is no such thing as a
>>>>>>>>>> reasonable
>>>>>>>>>> RecordContext.
>>>>>>>>>> After a join  the record I am processing belongs to at least 2
>>>>>>>>>> topics.
>>>>>>>>>> After a Group by the record I am processing was created from
>>>>>>>>>> multiple
>>>>>>>>>> offsets.
>>>>>>>>>>
>>>>>>>>>> The API Design doesn't look appealing
>>>>>>>>>>
>>>>>>>>>> Best Jan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear community,
>>>>>>>>>>>
>>>>>>>>>>> It seems the discussion for KIP-159 [1] converged finally. I
>>>>>>>>>>> would
>>>>>>>>>>> like to
>>>>>>>>>>> initiate voting for the particular KIP.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>
>>>>>>>>>> 159%3A+Introducing+Rich+functions+to+Streams
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Jeyhun
>>>>>>>>>>>
>>>>>>>>>>>
>

signature.asc
Description: OpenPGP digital signature

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Reply via email to