Yes. But I think an aggregation is an many-to-one operation, too. For the stripping off part: internally, we can just keep some record context, but just do not allow users to access it (because the context context does not make sense for them) by hiding the corresponding APIs.
-Matthias On 11/16/17 10:05 PM, Guozhang Wang wrote: > Matthias, > > For this idea, are your proposing that for any many-to-one mapping > operations (for now only Join operators), we will strip off the record > context in the resulted records and claim "we cannot infer its traced > context anymore"? > > > Guozhang > > > On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <matth...@confluent.io> > wrote: > >> Any thoughts about my latest proposal? >> >> -Matthias >> >> On 11/10/17 10:02 PM, Jan Filipiak wrote: >>> Hi, >>> >>> i think this is the better way. Naming is always tricky Source is kinda >>> taken >>> I had TopicBackedK[Source|Table] in mind >>> but for the user its way better already IMHO >>> >>> Thank you for reconsideration >>> >>> Best Jan >>> >>> >>> On 10.11.2017 22:48, Matthias J. Sax wrote: >>>> I was thinking about the source stream/table idea once more and it seems >>>> it would not be too hard to implement: >>>> >>>> We add two new classes >>>> >>>> SourceKStream extends KStream >>>> >>>> and >>>> >>>> SourceKTable extend KTable >>>> >>>> and return both from StreamsBuilder#stream and StreamsBuilder#table >>>> >>>> As both are sub-classes, this change is backward compatible. We change >>>> the return type for any single-record transform to this new types, too, >>>> and use KStream/KTable as return type for any multi-record operation. >>>> >>>> The new RecordContext API is added to both new classes. For old classes, >>>> we only implement KIP-149 to get access to the key. >>>> >>>> >>>> WDYT? >>>> >>>> >>>> -Matthias >>>> >>>> On 11/9/17 9:13 PM, Jan Filipiak wrote: >>>>> Okay, >>>>> >>>>> looks like it would _at least work_ for Cached KTableSources . >>>>> But we make it harder to the user to make mistakes by putting >>>>> features into places where they don't make sense and don't >>>>> help anyone. >>>>> >>>>> I once again think that my suggestion is easier to implement and >>>>> more correct. I will use this email to express my disagreement with the >>>>> proposed KIP (-1 non binding of course) state that I am open for any >>>>> questions >>>>> regarding this. I will also do the usual thing and point out that the >>>>> friends >>>>> over at Hive got it correct aswell. >>>>> One can not user their >>>>> https://cwiki.apache.org/confluence/display/Hive/ >> LanguageManual+VirtualColumns >>>>> >>>>> >>>>> in any place where its not read from the Sources. >>>>> >>>>> With KSQl in mind it makes me sad how this is evolving here. >>>>> >>>>> Best Jan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 10.11.2017 01:06, Guozhang Wang wrote: >>>>>> Hello Jan, >>>>>> >>>>>> Regarding your question about caching: today we keep the record >> context >>>>>> with the cached entry already so when we flush the cache which may >>>>>> generate >>>>>> new records forwarding we will set the record context appropriately; >>>>>> and >>>>>> then after the flush is completed we will reset the context to the >>>>>> record >>>>>> before the flush happens. But I think when Jeyhun did the PR it is a >>>>>> good >>>>>> time to double check on such stages to make sure we are not >>>>>> introducing any >>>>>> regressions. >>>>>> >>>>>> >>>>>> Guozhang >>>>>> >>>>>> >>>>>> On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak < >> jan.filip...@trivago.com> >>>>>> wrote: >>>>>> >>>>>>> I Aggree completely. >>>>>>> >>>>>>> Exposing this information in a place where it has no _natural_ >>>>>>> belonging >>>>>>> might really be a bad blocker in the long run. >>>>>>> >>>>>>> Concerning your first point. I would argue its not to hard to have a >>>>>>> user >>>>>>> keep track of these. If we still don't want the user >>>>>>> to keep track of these I would argue that all > projection only < >>>>>>> transformations on a Source-backed KTable/KStream >>>>>>> could also return a Ktable/KStream instance of the type we return >>>>>>> from the >>>>>>> topology builder. >>>>>>> Only after any operation that exceeds projection or filter one would >>>>>>> return a KTable not granting access to this any longer. >>>>>>> >>>>>>> Even then its difficult already: I never ran a topology with caching >>>>>>> but I >>>>>>> am not even 100% sure what the record Context means behind >>>>>>> a materialized KTable with Caching? Topic and Partition are probably >>>>>>> with >>>>>>> some reasoning but offset is probably only the offset causing the >>>>>>> flush? >>>>>>> So one might aswell think to drop offsets from this RecordContext. >>>>>>> >>>>>>> Best Jan >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 07.11.2017 03:18, Guozhang Wang wrote: >>>>>>> >>>>>>>> Regarding the API design (the proposed set of overloads v.s. one >>>>>>>> overload >>>>>>>> on #map to enrich the record), I think what we have represents a >> good >>>>>>>> trade-off between API succinctness and user convenience: on one >>>>>>>> hand we >>>>>>>> definitely want to keep as fewer overloaded functions as possible. >>>>>>>> But on >>>>>>>> the other hand if we only do that in, say, the #map() function then >>>>>>>> this >>>>>>>> enrichment could be an overkill: think of a topology that has 7 >>>>>>>> operators >>>>>>>> in a chain, where users want to access the record context on >>>>>>>> operator #2 >>>>>>>> and #6 only, with the "enrichment" manner they need to do the >>>>>>>> enrichment >>>>>>>> on >>>>>>>> operator #2 and keep it that way until #6. In addition, the >>>>>>>> RecordContext >>>>>>>> fields (topic, offset, etc) are really orthogonal to the key-value >>>>>>>> payloads >>>>>>>> themselves, so I think separating them into this object is a cleaner >>>>>>>> way. >>>>>>>> >>>>>>>> Regarding the RecordContext inheritance, this is actually a good >>>>>>>> point >>>>>>>> that >>>>>>>> have not been discussed thoroughly before. Here are my my two cents: >>>>>>>> one >>>>>>>> natural way would be to inherit the record context from the >>>>>>>> "triggering" >>>>>>>> record, for example in a join operator, if the record from stream A >>>>>>>> triggers the join then the record context is inherited from with >> that >>>>>>>> record. This is also aligned with the lower-level PAPI interface. A >>>>>>>> counter >>>>>>>> argument, though, would be that this is sort of leaking the internal >>>>>>>> implementations of the DSL, so that moving forward if we did some >>>>>>>> refactoring to our join implementations so that the triggering >>>>>>>> record can >>>>>>>> change, the RecordContext would also be different. I do not know how >>>>>>>> much >>>>>>>> it would really affect end users, but would like to hear your >>>>>>>> opinions. >>>>>>>> >>>>>>> Agreed to 100% exposing this information >>>>>>> >>>>>>> >>>>>>>> Guozhang >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov < >> je.kari...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Jan, >>>>>>>>> Sorry for late reply. >>>>>>>>> >>>>>>>>> >>>>>>>>> The API Design doesn't look appealing >>>>>>>>> >>>>>>>>> >>>>>>>>> In terms of API design we tried to preserve the java functional >>>>>>>>> interfaces. >>>>>>>>> We applied the same set of rich methods for KTable to make it >>>>>>>>> compatible >>>>>>>>> with the rest of overloaded APIs. >>>>>>>>> >>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is >>>>>>>>> directly >>>>>>>>> >>>>>>>>>> feed from a topic with 1 additional overload for the #map() >>>>>>>>>> methods to >>>>>>>>>> cover every usecase while keeping the API in a way better state. >>>>>>>>>> >>>>>>>>> - IMO this seems a workaround, rather than a direct solution. >>>>>>>>> >>>>>>>>> Perhaps we should continue this discussion in DISCUSS thread. >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Jeyhun >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak >>>>>>>>> <jan.filip...@trivago.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi. >>>>>>>>>> I do understand that it might come in Handy. >>>>>>>>>> From my POV in any relational algebra this is only a >>>>>>>>>> projection. >>>>>>>>>> Currently we hide these "fields" that come with the input record. >>>>>>>>>> It should be 100% sufficient to offer a KTable + KStream that is >>>>>>>>>> directly >>>>>>>>>> feed from a topic with 1 additional overload for the #map() >>>>>>>>>> methods to >>>>>>>>>> cover every usecase while keeping the API in a way better state. >>>>>>>>>> >>>>>>>>>> best Jan >>>>>>>>>> >>>>>>>>>> On 06.11.2017 17:52, Matthias J. Sax wrote: >>>>>>>>>> >>>>>>>>>>> Jan, >>>>>>>>>>> >>>>>>>>>>> I understand what you are saying. However, having a >>>>>>>>>>> RecordContext is >>>>>>>>>>> super useful for operations that are applied to input topic. Many >>>>>>>>>>> users >>>>>>>>>>> requested this feature -- it's much more convenient that falling >>>>>>>>>>> back >>>>>>>>>>> >>>>>>>>>> to >>>>>>>>>> transform() to implement a a filter() for example that want to >>>>>>>>>> access >>>>>>>>>>> some meta data. >>>>>>>>>>> >>>>>>>>>>> Because we cannot distinguish different "origins" of a >>>>>>>>>>> KStream/KTable, >>>>>>>>>>> >>>>>>>>>> I >>>>>>>>>> am not sure if there would be a better way to do this. The only >>>>>>>>>>> "workaround" I see, is to have two KStream/KTable interfaces each >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>> we >>>>>>>>>> would use the first one for KStream/KTable with "proper" >>>>>>>>>> RecordContext. >>>>>>>>>>> But this does not seem to be a good solution either. >>>>>>>>>>> >>>>>>>>>>> Note, a KTable can also be read directly from a topic, I agree >>>>>>>>>>> that >>>>>>>>>>> using RecordContext on a KTable that is the result of an >>>>>>>>>>> aggregation is >>>>>>>>>>> questionable. But I don't see a reason to down vote the KIP for >>>>>>>>>>> this >>>>>>>>>>> >>>>>>>>>> reason. >>>>>>>>>> >>>>>>>>>>> WDYT about this? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -Matthias >>>>>>>>>>> >>>>>>>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote: >>>>>>>>>>> >>>>>>>>>>>> -1 non binding >>>>>>>>>>>> >>>>>>>>>>>> I don't get the motivation. >>>>>>>>>>>> In 80% of my DSL processors there is no such thing as a >>>>>>>>>>>> reasonable >>>>>>>>>>>> RecordContext. >>>>>>>>>>>> After a join the record I am processing belongs to at least 2 >>>>>>>>>>>> topics. >>>>>>>>>>>> After a Group by the record I am processing was created from >>>>>>>>>>>> multiple >>>>>>>>>>>> offsets. >>>>>>>>>>>> >>>>>>>>>>>> The API Design doesn't look appealing >>>>>>>>>>>> >>>>>>>>>>>> Best Jan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Dear community, >>>>>>>>>>>>> >>>>>>>>>>>>> It seems the discussion for KIP-159 [1] converged finally. I >>>>>>>>>>>>> would >>>>>>>>>>>>> like to >>>>>>>>>>>>> initiate voting for the particular KIP. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>>>>>>>>>>> >>>>>>>>>>>> 159%3A+Introducing+Rich+functions+to+Streams >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Jeyhun >>>>>>>>>>>>> >>>>>>>>>>>>> >>> >> >> > >
signature.asc
Description: OpenPGP digital signature