Re: [DISCUSS] KIP-213 Support non-key joining in KTable

Guozhang Wang Wed, 01 Nov 2017 17:18:06 -0700

Thanks for the KIP writeup Jan. I made a first pass and here are some quick
comments:



1. Could we use K0 / V0 and K1 / V1 etc since K0 and KO are a bit harder to
differentiate when reading.

2. I think you missed the key type in the intrusive approach example code
snippet regarding "KTable <V0> oneToManyJoin"? Should that be

KTable<CombinedKey<K,KO>, V0> oneToManyJoin

3. Some of the arrows in your algorithm section's diagrams seems reversed.

4. In the first step of the algorithm, "Materialize B first", that happens
in the "Repartition by A's key" block right? If yes, could you clarify it
in the block?

5. "skip old if A's key didn't change": hmm, not sure if we can skip it.
What if other fields (neither A's key or B's key) changes? Suppose you have
an aggregation after the join, we still need to subtract the old value from
the aggregation right?

6. In the block of "Materialize B", I think from your description we are
actually materializing both A and B right? If yes could you update the
diagram?

7. This is a meta question: "in the sink, only use A's key to determine
partition" I think we had the discussion long time ago, that if we are
sending the old and new entries of the pair to different partitions, their
ordering may get reversed later when reading from the join operator (i.e.
the "Materialize B" block in your diagram). How did you address that with
this proposal?

8. "B records with a 'null' A-key value would be silently dropped" Where
are we dropping it, do we drop it at the first sub-topology (i.e the
"Repartition by A's key" block)?

Guozhang





On Wed, Nov 1, 2017 at 12:18 PM, Jan Filipiak <[email protected]>
wrote:

> Hi thanks for the feedback
>
> On 01.11.2017 12:58, Damian Guy wrote:
>
>> Hi Jan, Thanks for the KIP!
>>
>> In both alternatives the API will need to use the `Joined` class rather
>> than than passing in `Serde`s. Also, as with all other joins etc, there
>> probably should be an overload that doesn't require any `Serdes`.
>>
> Will check again how current API looks. I remember loosing the argument
> with this IQ overloads things.
> Didn't expect something to have happend already so I just copied from the
> PR. Will update.
> Will also add the overload.
>
>>
>> It isn't clear to me what `joinPrefixFaker` is doing? In the comment it
>> says "returning an outputKey that when serialized only produces a prefix
>> of
>> the output key which is the same serializing K" So why not just use "K" ?
>>
> The faker in fact returns K wich can be serialized by the Key Serde in the
> rocks. But it needs to only contain A's key and it needs to be a strict
> prefix
> byte[] of all K with this A's key. We gonna seek there with an
> RocksIterator and continue to read as long as the "faked key" serialized
> form is a prefix
> This is easy todo for Avro + Protobuf +  custom Serdes and Hadoop
> Writables. Its a nightmare for JSON serdes.
>
>
>
>> Thanks,
>> Damian
>>
>>
>> On Fri, 27 Oct 2017 at 10:27 Ted Yu <[email protected]> wrote:
>>
>> I think if you explain what A and B are in the beginning, it makes sense
>>> to
>>> use them since readers would know who they reference.
>>>
>>> Cheers
>>>
>>> On Thu, Oct 26, 2017 at 11:04 PM, Jan Filipiak <[email protected]
>>> >
>>> wrote:
>>>
>>>
>>>> Thanks for the remarks. hope I didn't miss any.
>>>> Not even sure if it makes sense to introduce A and B or just stick with
>>>> "this ktable", "other ktable"
>>>>
>>>> Thank you
>>>> Jan
>>>>
>>>>
>>>> On 27.10.2017 06:58, Ted Yu wrote:
>>>>
>>>> Do you mind addressing my previous comments ?
>>>>>
>>>>> http://search-hadoop.com/m/Kafka/uyzND1hzF8SRzUqb?subj=Re+
>>>>> DISCUSS+KIP+213+Support+non+key+joining+in+KTable
>>>>>
>>>>> On Thu, Oct 26, 2017 at 9:38 PM, Jan Filipiak <
>>>>> [email protected]
>>>>> wrote:
>>>>>
>>>>> Hello everyone,
>>>>>
>>>>>> this is the new discussion thread after the ID-clash.
>>>>>>
>>>>>> Best
>>>>>> Jan
>>>>>>
>>>>>> ______
>>>>>>
>>>>>>
>>>>>> Hello Kafka-users,
>>>>>>
>>>>>> I want to continue with the development of KAFKA-3705, which allows
>>>>>> the
>>>>>> Streams DSL to perform KTableKTable-Joins when the KTables have a
>>>>>> one-to-many relationship.
>>>>>> To make sure we cover the requirements of as many users as possible
>>>>>> and
>>>>>> have a good solution afterwards I invite everyone to read through the
>>>>>> KIP I
>>>>>> put together and discuss it here in this Thread.
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+
>>>>>> Support+non-key+joining+in+KTable
>>>>>> https://issues.apache.org/jira/browse/KAFKA-3705
>>>>>> https://github.com/apache/kafka/pull/3720
>>>>>>
>>>>>> I think a public discussion and vote on a solution is exactly what is
>>>>>> needed to bring this feauture into kafka-streams. I am looking forward
>>>>>>
>>>>> to
>>>
>>>> everyones opinion!
>>>>>>
>>>>>> Please keep the discussion on the mailing list rather than commenting
>>>>>>
>>>>> on
>>>
>>>> the wiki (wiki discussions get unwieldy fast).
>>>>>>
>>>>>> Best
>>>>>> Jan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

Reply via email to