Re: [DISCUSS] KIP-213 Support non-key joining in KTable

Matthias J. Sax Mon, 06 Nov 2017 15:11:05 -0800

It very useful! Thanks a lot! I can follow now whats going on :)

One clarification question: Second input is for example


key: B0 : value [A2,B0 ...]

The second B0 in the value is actually irrelevant, right? The partition
key A2B0 is build from B0 from the key, right?

-Matthias

On 11/6/17 10:46 PM, Jan Filipiak wrote:
> Going to put all possible szenarios. Just wanted to layout the table and
> ask if its usefull and understandable first
> 
> On 06.11.2017 22:33, Ted Yu wrote:
>> bq. Update in A delete in A update in B delete in B
>>
>> Are you going to fill in the above scenario (currently blank) ?
>>
>> On Mon, Nov 6, 2017 at 12:31 PM, Jan Filipiak <jan.filip...@trivago.com>
>> wrote:
>>
>>> I created an example Table in the WIKI page
>>> Can you quickly check if that would be a good format?
>>> I tried todo it ~like the unit tests but with the information of what
>>> state is there _AFTER_
>>> processing happend.
>>> I make the first 2 columns exclusive even though the in fact run in
>>> parallel but the joining
>>> task serializes the effects.
>>>
>>> Best Jan
>>>
>>> On 06.11.2017 21:20, Jan Filipiak wrote:
>>>
>>>> Will do! Need to do it carefully. One mistake in this detailed approach
>>>> and confusion is perfect ;)
>>>> Hope I can deliver this week.
>>>>
>>>> Best Jan
>>>>
>>>>
>>>> On 06.11.2017 17:21, Matthias J. Sax wrote:
>>>>
>>>>> Jan,
>>>>>
>>>>> thanks a lot for this KIP. I did an initial pass over it, but feel a
>>>>> little lost. Maybe I need to read it more carefully, but atm it's not
>>>>> clear to me at all what algorithm you propose.
>>>>>
>>>>> I think it would be super helpful, to do an example with concrete data
>>>>> that show how records are stored, what the different value mappers
>>>>> extract, and what is written into repartitioning topics.
>>>>>
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> On 11/5/17 2:09 AM, Jan Filipiak wrote:
>>>>>
>>>>>> Hi Gouzhang
>>>>>>
>>>>>> I hope the wikipage looks better now. made a little more effort
>>>>>> into the
>>>>>> diagram. Still not ideal but I think it serves its purpose.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02.11.2017 01:17, Guozhang Wang wrote:
>>>>>>
>>>>>>> Thanks for the KIP writeup Jan. I made a first pass and here are
>>>>>>> some
>>>>>>> quick
>>>>>>> comments:
>>>>>>>
>>>>>>>
>>>>>>> 1. Could we use K0 / V0 and K1 / V1 etc since K0 and KO are a bit
>>>>>>> harder to
>>>>>>> differentiate when reading.
>>>>>>>
>>>>>>> 2. I think you missed the key type in the intrusive approach example
>>>>>>> code
>>>>>>> snippet regarding "KTable <V0> oneToManyJoin"? Should that be
>>>>>>>
>>>>>>> KTable<CombinedKey<K,KO>, V0> oneToManyJoin
>>>>>>>
>>>>>>> 3. Some of the arrows in your algorithm section's diagrams seems
>>>>>>> reversed.
>>>>>>>
>>>>>>> 4. In the first step of the algorithm, "Materialize B first", that
>>>>>>> happens
>>>>>>> in the "Repartition by A's key" block right? If yes, could you
>>>>>>> clarify
>>>>>>> it
>>>>>>> in the block?
>>>>>>>
>>>>>>> 5. "skip old if A's key didn't change": hmm, not sure if we can skip
>>>>>>> it.
>>>>>>> What if other fields (neither A's key or B's key) changes?
>>>>>>> Suppose you
>>>>>>> have
>>>>>>> an aggregation after the join, we still need to subtract the old
>>>>>>> value
>>>>>>> from
>>>>>>> the aggregation right?
>>>>>>>
>>>>>>> 6. In the block of "Materialize B", I think from your description we
>>>>>>> are
>>>>>>> actually materializing both A and B right? If yes could you
>>>>>>> update the
>>>>>>> diagram?
>>>>>>>
>>>>>>> 7. This is a meta question: "in the sink, only use A's key to
>>>>>>> determine
>>>>>>> partition" I think we had the discussion long time ago, that if
>>>>>>> we are
>>>>>>> sending the old and new entries of the pair to different partitions,
>>>>>>> their
>>>>>>> ordering may get reversed later when reading from the join operator
>>>>>>> (i.e.
>>>>>>> the "Materialize B" block in your diagram). How did you address that
>>>>>>> with
>>>>>>> this proposal?
>>>>>>>
>>>>>>> 8. "B records with a 'null' A-key value would be silently dropped"
>>>>>>> Where
>>>>>>> are we dropping it, do we drop it at the first sub-topology (i.e the
>>>>>>> "Repartition by A's key" block)?
>>>>>>>
>>>>>>> Guozhang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 1, 2017 at 12:18 PM, Jan Filipiak <
>>>>>>> jan.filip...@trivago.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi thanks for the feedback
>>>>>>>> On 01.11.2017 12:58, Damian Guy wrote:
>>>>>>>>
>>>>>>>> Hi Jan, Thanks for the KIP!
>>>>>>>>> In both alternatives the API will need to use the `Joined` class
>>>>>>>>> rather
>>>>>>>>> than than passing in `Serde`s. Also, as with all other joins etc,
>>>>>>>>> there
>>>>>>>>> probably should be an overload that doesn't require any `Serdes`.
>>>>>>>>>
>>>>>>>>> Will check again how current API looks. I remember loosing the
>>>>>>>> argument
>>>>>>>> with this IQ overloads things.
>>>>>>>> Didn't expect something to have happend already so I just copied
>>>>>>>> from
>>>>>>>> the
>>>>>>>> PR. Will update.
>>>>>>>> Will also add the overload.
>>>>>>>>
>>>>>>>> It isn't clear to me what `joinPrefixFaker` is doing? In the
>>>>>>>> comment
>>>>>>>>> it
>>>>>>>>> says "returning an outputKey that when serialized only produces a
>>>>>>>>> prefix
>>>>>>>>> of
>>>>>>>>> the output key which is the same serializing K" So why not just
>>>>>>>>> use
>>>>>>>>> "K" ?
>>>>>>>>>
>>>>>>>>> The faker in fact returns K wich can be serialized by the Key
>>>>>>>>> Serde
>>>>>>>> in the
>>>>>>>> rocks. But it needs to only contain A's key and it needs to be a
>>>>>>>> strict
>>>>>>>> prefix
>>>>>>>> byte[] of all K with this A's key. We gonna seek there with an
>>>>>>>> RocksIterator and continue to read as long as the "faked key"
>>>>>>>> serialized
>>>>>>>> form is a prefix
>>>>>>>> This is easy todo for Avro + Protobuf +  custom Serdes and Hadoop
>>>>>>>> Writables. Its a nightmare for JSON serdes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>> Damian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 27 Oct 2017 at 10:27 Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I think if you explain what A and B are in the beginning, it makes
>>>>>>>>> sense
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>> use them since readers would know who they reference.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 26, 2017 at 11:04 PM, Jan Filipiak
>>>>>>>>>> <jan.filip...@trivago.com
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for the remarks. hope I didn't miss any.
>>>>>>>>>>> Not even sure if it makes sense to introduce A and B or just
>>>>>>>>>>> stick
>>>>>>>>>>> with
>>>>>>>>>>> "this ktable", "other ktable"
>>>>>>>>>>>
>>>>>>>>>>> Thank you
>>>>>>>>>>> Jan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 27.10.2017 06:58, Ted Yu wrote:
>>>>>>>>>>>
>>>>>>>>>>> Do you mind addressing my previous comments ?
>>>>>>>>>>>
>>>>>>>>>>>> http://search-hadoop.com/m/Kafka/uyzND1hzF8SRzUqb?subj=Re+
>>>>>>>>>>>> DISCUSS+KIP+213+Support+non+key+joining+in+KTable
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Oct 26, 2017 at 9:38 PM, Jan Filipiak <
>>>>>>>>>>>> jan.filip...@trivago.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> this is the new discussion thread after the ID-clash.
>>>>>>>>>>>>> Best
>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>
>>>>>>>>>>>>> ______
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Kafka-users,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I want to continue with the development of KAFKA-3705, which
>>>>>>>>>>>>> allows
>>>>>>>>>>>>> the
>>>>>>>>>>>>> Streams DSL to perform KTableKTable-Joins when the KTables
>>>>>>>>>>>>> have a
>>>>>>>>>>>>> one-to-many relationship.
>>>>>>>>>>>>> To make sure we cover the requirements of as many users as
>>>>>>>>>>>>> possible
>>>>>>>>>>>>> and
>>>>>>>>>>>>> have a good solution afterwards I invite everyone to read
>>>>>>>>>>>>> through the
>>>>>>>>>>>>> KIP I
>>>>>>>>>>>>> put together and discuss it here in this Thread.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+
>>>>>>>>>>>>> Support+non-key+joining+in+KTable
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-3705
>>>>>>>>>>>>> https://github.com/apache/kafka/pull/3720
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think a public discussion and vote on a solution is exactly
>>>>>>>>>>>>> what is
>>>>>>>>>>>>> needed to bring this feauture into kafka-streams. I am looking
>>>>>>>>>>>>> forward
>>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>>>>> everyones opinion!
>>>>>>>>>>>
>>>>>>>>>>>> Please keep the discussion on the mailing list rather than
>>>>>>>>>>>>> commenting
>>>>>>>>>>>>>
>>>>>>>>>>>>> on
>>>>>>>>>>> the wiki (wiki discussions get unwieldy fast).
>>>>>>>>>>>
>>>>>>>>>>>> Best
>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-213 Support non-key joining in KTable

Reply via email to