Re: [DISCUSS] KIP-213 Support non-key joining in KTable

Matthias J. Sax Mon, 20 Nov 2017 13:32:03 -0800

Just list what each thing is:

K0: key type of first/this table
K1: key type of second/other table
KO: key type of result table (concatenation of both input keys <K1-K0>)



something like this (not sure it the example above is correct---it's
just for illustration)


-Matthias


On 11/18/17 2:30 PM, Jan Filipiak wrote:
> -> it think the relationships between the different used types, K0,K1,KO
> should be explains explicitly (all information is there implicitly, but
> one need to think hard to figure it out)
> 
> 
> I'm probably blind for this. can you help me here? how would you
> formulate this?
> 
> Thanks,
> 
> Jan
> 
> 
> On 16.11.2017 23:18, Matthias J. Sax wrote:
>> Hi,
>>
>> I am just catching up on this discussion and did re-read the KIP and
>> discussion thread.
>>
>> In contrast to you, I prefer the second approach with CombinedKey as
>> return type for the following reasons:
>>
>>   1) the oneToManyJoin() method had less parameter
>>   2) those parameters are easy to understand
>>   3) we hide implementation details (joinPrefixFaker, leftKeyExtractor,
>> and the return type KO leaks internal implementation details from my
>> point of view)
>>   4) user can get their own KO type by extending CombinedKey interface
>> (this would also address the nesting issue Trevor pointed out)
>>
>> That's unclear to me is, why you care about JSON serdes? What is the
>> problem with regard to prefix? It seems I am missing something here.
>>
>> I also don't understand the argument about "the user can stick with his
>> default serde or his standard way of serializing"? If we have
>> `CombinedKey` as output, the use just provide the serdes for both input
>> combined-key types individually, and we can reuse both internally to do
>> the rest. This seems to be a way simpler API. With the KO output type
>> approach, users need to write an entirely new serde for KO in contrast.
>>
>> Finally, @Jan, there are still some open comments you did not address
>> and the KIP wiki page needs some updates. Would be great if you could do
>> this.
>>
>> Can you also explicitly describe the data layout of the store that is
>> used to do the range scans?
>>
>> Additionally:
>>
>> -> some arrows in the algorithm diagram are missing
>> -> was are those XXX in the diagram
>> -> can you finish the "Step by Step" example
>> -> it think the relationships between the different used types, K0,K1,KO
>> should be explains explicitly (all information is there implicitly, but
>> one need to think hard to figure it out)
>>
>>
>> Last but not least:
>>
>>> But noone is really interested.
>> Don't understand this statement...
>>
>>
>>
>> -Matthias
>>
>>
>> On 11/16/17 9:05 AM, Jan Filipiak wrote:
>>> We are running this perfectly fine. for us the smaller table changes
>>> rather infrequent say. only a few times per day. The performance of the
>>> flush is way lower than the computing power you need to bring to the
>>> table to account for all the records beeing emmited after the one single
>>> update.
>>>
>>> On 16.11.2017 18:02, Trevor Huey wrote:
>>>> Ah, I think I see the problem now. Thanks for the explanation. That is
>>>> tricky. As you said, it seems the easiest solution would just be to
>>>> flush the cache. I wonder how big of a performance hit that'd be...
>>>>
>>>> On Thu, Nov 16, 2017 at 9:07 AM Jan Filipiak <jan.filip...@trivago.com
>>>> <mailto:jan.filip...@trivago.com>> wrote:
>>>>
>>>>      Hi Trevor,
>>>>
>>>>      I am leaning towards the less intrusive approach myself. Infact
>>>>      that is how we implemented our Internal API for this and how we
>>>>      run it in production.
>>>>      getting more voices towards this solution makes me really happy.
>>>>      The reason its a problem for Prefix and not for Range is the
>>>>      following. Imagine the intrusive approach. They key of the RockDB
>>>>      would be CombinedKey<A,B> and the prefix scan would take an A, and
>>>>      the range scan would take an CombinedKey<A,B> still. As you can
>>>>      see with the intrusive approach the keys are actually different
>>>>      types for different queries. With the less intrusive apporach we
>>>>      use the same type and rely on Serde Invariances. For us this works
>>>>      nice (protobuf) might bite some JSON users.
>>>>
>>>>      Hope it makes it clear
>>>>
>>>>      Best Jan
>>>>
>>>>
>>>>      On 16.11.2017 16:39, Trevor Huey wrote:
>>>>>      1. Going over KIP-213, I am leaning toward the "less intrusive"
>>>>>      approach. In my use case, I am planning on performing a sequence
>>>>>      of several oneToMany joins, From my understanding, the more
>>>>>      intrusive approach would result in several nested levels of
>>>>>      CombinedKey's. For example, consider Tables A, B, C, D with
>>>>>      corresponding keys KA, KB, KC. Joining A and B would produce
>>>>>      CombinedKey<KA, KB>. Then joining that result on C would produce
>>>>>      CombinedKey<KC, CombinedKey<KA, KB>>. My "keyOtherSerde" in this
>>>>>      case would need to be capable of deserializing CombinedKey<KA,
>>>>>      KB>. This would just get worse the more tables I join. I realize
>>>>>      that it's easier to shoot yourself in the foot with the less
>>>>>      intrusive approach, but as you said, " the user can stick with
>>>>>      his default serde or his standard way of serializing". In the
>>>>>      simplest case where the keys are just strings, they can do simple
>>>>>      string concatenation and Serdes.String(). It also allows the user
>>>>>      to create and use their own version of CombinedKey if they feel
>>>>>      so inclined.
>>>>>
>>>>>      2. Why is there a problem for prefix, but not for range?
>>>>>    
>>>>> https://github.com/apache/kafka/pull/3720/files#diff-8f863b74c3c5a0b989e89d00c149aef1L162
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>      On Thu, Nov 16, 2017 at 2:57 AM Jan Filipiak
>>>>>      <jan.filip...@trivago.com <mailto:jan.filip...@trivago.com>>
>>>>> wrote:
>>>>>
>>>>>          Hi Trevor,
>>>>>
>>>>>          thank you very much for your interested. Too keep discussion
>>>>>          mailing list focused and not Jira or Confluence I decided to
>>>>>          reply here.
>>>>>
>>>>>          1. its tricky activity is indeed very low. In the KIP-213
>>>>>          there are 2 proposals about the return type of the join. I
>>>>>          would like to settle on one.
>>>>>          Unfortunatly its controversal and I don't want to have the
>>>>>          discussion after I settled on one way and implemented it. But
>>>>>          noone is really interested.
>>>>>          So discussing with YOU, what your preferred return type would
>>>>>          look would be very helpfull already.
>>>>>
>>>>>          2.
>>>>>          The most difficult part is implementing
>>>>>          this
>>>>>        
>>>>> https://github.com/apache/kafka/pull/3720/files#diff-ac41b4dfb9fc6bb707d966477317783cR68
>>>>>
>>>>>
>>>>>          here
>>>>>        
>>>>> https://github.com/apache/kafka/pull/3720/files#diff-8f863b74c3c5a0b989e89d00c149aef1R244
>>>>>
>>>>>
>>>>>          and here
>>>>>        
>>>>> https://github.com/apache/kafka/pull/3720/files#diff-b1a1281dce5219fd0cb5afad380d9438R207
>>>>>
>>>>>
>>>>>          One can get an easy shot by just flushing the underlying
>>>>>          rocks and using Rocks for range scan.
>>>>>          But as you can see the implementation depends on the API. For
>>>>>          wich way the API discussion goes
>>>>>          I would implement this differently.
>>>>>
>>>>>          3.
>>>>>          I only have so and so much time to work on this. I filed the
>>>>>          KIP because I want to pull it through and I am pretty
>>>>>          confident that I can do it.
>>>>>          But I am still waiting for the full discussion to happen on
>>>>>          this. To get the discussion forward it seems to be that I
>>>>>          need to fill out the table in
>>>>>          the KIP entirly (the one describing the events, change
>>>>>          modifications and output). Feel free to continue the
>>>>>          discussion w/o the table. I want
>>>>>          to finish the table during next week.
>>>>>
>>>>>          Best Jan thank you for your interest!
>>>>>
>>>>>          _____ Jira Quote ______
>>>>>
>>>>>          Jan Filipiak
>>>>>        
>>>>> <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jfilipiak>
>>>>>
>>>>>          Please bear with me while I try to get caught up. I'm not yet
>>>>>          familiar with the Kafka code base. I have a few questions to
>>>>>          try to figure out how I can get involved:
>>>>>          1. It seems like we need to get buy-in on your KIP-213? It
>>>>>          doesn't seem like there's been much activity on it besides
>>>>>          yourself in a while. What's your current plan of attack for
>>>>>          getting that approved?
>>>>>          2. I know you said that the most difficult part is yet to be
>>>>>          done. Is there some code you can point me toward so I can
>>>>>          start digging in and better understand why this is so
>>>>> difficult?
>>>>>          3. This issue has been open since May '16. How far out do you
>>>>>          think we are from getting this implemented?
>>>>>
>>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

Reply via email to