Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

Guozhang Wang Wed, 29 Aug 2018 09:43:46 -0700

Hi Adam,

What do you mean by "additional comments on the workflow.", do you mean to
let other review your PR https://github.com/apache/kafka/pull/5527 ? Is is
ready for reviews?



Guozhang

On Tue, Aug 28, 2018 at 5:00 AM, Adam Bellemare <[email protected]>
wrote:

> Okay, I will implement John's suggestion of namespacing the external
> headers prior to processing, and then removing the namespacing prior to
> emitting. A potential future KIP could be to provide this namespacing
> automatically.
>
> I would also appreciate any other additional comments on the workflow. My
> goal is suss out agreement prior to moving to a vote.
>
> On Mon, Aug 27, 2018 at 3:19 PM, Guozhang Wang <[email protected]> wrote:
>
> > I like John's idea as well: for this KIP specifically as we do not expect
> > any other consumers to read the repartition topics externally, we can
> > slightly prefix the header to be safe, while keeping the additional cost
> > (note the header field is per-record, so any additional byte is
> per-record
> > as well) low.
> >
> >
> > Guozhang
> >
> > On Tue, Aug 21, 2018 at 11:58 AM, Adam Bellemare <
> [email protected]
> > >
> > wrote:
> >
> > > Hi John
> > >
> > > That is an excellent idea. The header usage I propose would be limited
> > > entirely to internal topics, and this could very well be the solution
> to
> > > potential conflicts. If we do not officially reserve a prefix "__"
> then I
> > > think this would be the safest idea, as it would entirely avoid any
> > > accidents (perhaps if a company is using its own "__" prefix for other
> > > reasons).
> > >
> > > Thanks
> > >
> > > Adam
> > >
> > >
> > > On Tue, Aug 21, 2018 at 2:24 PM, John Roesler <[email protected]>
> wrote:
> > >
> > > > Just a quick thought regarding headers:
> > > > > I think there is no absolute-safe ways to avoid conflicts, but we
> can
> > > > still
> > > > > consider using some name patterns to reduce the likelihood as much
> as
> > > > > possible.. e.g. consider sth. like the internal topics naming: e.g.
> > > > > "__internal_[name]"?
> > > >
> > > > I think there is a safe way to avoid conflicts, since these headers
> are
> > > > only needed in internal topics (I think):
> > > > For internal and changelog topics, we can namespace all headers:
> > > > * user-defined headers are namespaced as "external." + headerKey
> > > > * internal headers are namespaced as "internal." + headerKey
> > > >
> > > > This is a lot of characters, so we could use a sigil instead (e.g.,
> "_"
> > > for
> > > > internal, "~" for external)
> > > >
> > > > We simply apply the namespacing when we read user headers from
> external
> > > > topics into the topology and then de-namespace them before we emit
> them
> > > to
> > > > an external topic (via "to" or "through").
> > > > Now, it is not possible to collide with user-defined headers.
> > > >
> > > > That said, I'd also be fine with just reserving "__" as a header
> prefix
> > > and
> > > > not worrying about collisions.
> > > >
> > > > Thanks for the KIP,
> > > > -John
> > > >
> > > > On Tue, Aug 21, 2018 at 9:48 AM Jan Filipiak <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Still havent completly grabbed it.
> > > > > sorry will read more
> > > > >
> > > > > On 17.08.2018 21:23, Jan Filipiak wrote:
> > > > > > Cool stuff.
> > > > > >
> > > > > > I made some random remarks. Did not touch the core of the
> algorithm
> > > > yet.
> > > > > >
> > > > > > Will do Monday 100%
> > > > > >
> > > > > > I don't see Interactive Queries :) like that!
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 17.08.2018 20:28, Adam Bellemare wrote:
> > > > > >> I have submitted a PR with my code against trunk:
> > > > > >> https://github.com/apache/kafka/pull/5527
> > > > > >>
> > > > > >> Do I continue on this thread or do we begin a new one for
> > > discussion?
> > > > > >>
> > > > > >> On Thu, Aug 16, 2018 at 1:40 AM, Jan Filipiak <
> > > > [email protected]
> > > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> even before message headers, the option for me always existed
> to
> > > > > >>> just wrap
> > > > > >>> the messages into my own custom envelop.
> > > > > >>> So I of course thought this through. One sentence in your last
> > > email
> > > > > >>> triggered all the thought process I put in the back then
> > > > > >>> again to design it in the, what i think is the "kafka-way". It
> > > ended
> > > > up
> > > > > >>> ranting a little about what happened in the past.
> > > > > >>>
> > > > > >>> I see plenty of colleagues of mine falling into traps in the
> API,
> > > > > >>> that I
> > > > > >>> did warn about in the 1.0 DSL rewrite. I have the same
> > > > > >>> feeling again. So I hope it gives you some insights into my
> > though
> > > > > >>> process. I am aware that since i never ported 213 to higher
> > > > > >>> streams version, I don't really have a steak here and
> initially I
> > > > > >>> didn't
> > > > > >>> feel like actually sending it. But maybe you can pull
> > > > > >>> something good from it.
> > > > > >>>
> > > > > >>>   Best jan
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On 15.08.2018 04:44, Adam Bellemare wrote:
> > > > > >>>
> > > > > >>>> @Jan
> > > > > >>>> Thanks Jan. I take it you mean "key-widening" somehow includes
> > > > > >>>> information
> > > > > >>>> about which record is processed first? I understand about a
> > > > > >>>> CombinedKey
> > > > > >>>> with both the Foreign and Primary key, but I don't see how you
> > > track
> > > > > >>>> ordering metadata in there unless you actually included a
> > metadata
> > > > > >>>> field
> > > > > >>>> in
> > > > > >>>> the key type as well.
> > > > > >>>>
> > > > > >>>> @Guozhang
> > > > > >>>> As Jan mentioned earlier, is Record Headers mean to strictly
> be
> > > > > >>>> used in
> > > > > >>>> just the user-space? It seems that it is possible that a
> > collision
> > > > > >>>> on the
> > > > > >>>> (key,value) tuple I wish to add to it could occur. For
> instance,
> > > if
> > > > I
> > > > > >>>> wanted to add a ("foreignKeyOffset",10) to the Headers but the
> > > user
> > > > > >>>> already
> > > > > >>>> specified their own header with the same key name, then it
> > appears
> > > > > >>>> there
> > > > > >>>> would be a collision. (This is one of the issues I brought up
> in
> > > > > >>>> the KIP).
> > > > > >>>>
> > > > > >>>> --------------------------------
> > > > > >>>>
> > > > > >>>> I will be posting a prototype PR against trunk within the next
> > day
> > > > > >>>> or two.
> > > > > >>>> One thing I need to point out is that my design very strictly
> > > wraps
> > > > > >>>> the
> > > > > >>>> entire foreignKeyJoin process entirely within the DSL
> function.
> > > > > >>>> There is
> > > > > >>>> no
> > > > > >>>> exposure of CombinedKeys or widened keys, nothing to resolve
> > with
> > > > > >>>> regards
> > > > > >>>> to out-of-order processing and no need for the DSL user to
> even
> > > know
> > > > > >>>> what's
> > > > > >>>> going on inside of the function. The code simply returns the
> > > > > >>>> results of
> > > > > >>>> the
> > > > > >>>> join, keyed by the original key. Currently my API mirrors
> > > > > >>>> identically the
> > > > > >>>> format of the data returned by the regular join function, and
> I
> > > > > >>>> believe
> > > > > >>>> that this is very useful to many users of the DSL. It is my
> > > > > >>>> understanding
> > > > > >>>> that one of the main design goals of the DSL is to provide
> > higher
> > > > > >>>> level
> > > > > >>>> functionality without requiring the users to know exactly
> what's
> > > > > >>>> going on
> > > > > >>>> under the hood. With this in mind, I thought it best to solve
> > > > > >>>> ordering and
> > > > > >>>> partitioning problems within the function and eliminate the
> > > > > >>>> requirement
> > > > > >>>> for
> > > > > >>>> users to do additional work after the fact to resolve the
> > results
> > > > > >>>> of their
> > > > > >>>> join. Basically, I am assuming that most users of the DSL just
> > > > > >>>> "want it to
> > > > > >>>> work" and want it to be easy. I did this operating under the
> > > > > >>>> assumption
> > > > > >>>> that if a user truly wants to optimize their own workflow down
> > to
> > > > the
> > > > > >>>> finest details then they will break from strictly using the
> DSL
> > > and
> > > > > >>>> move
> > > > > >>>> down to the processors API.
> > > > > >>>>
> > > > > >>> I think. The abstraction is not powerful enough
> > > > > >>> to not have kafka specifics leak up The leak I currently think
> > this
> > > > > >>> has is
> > > > > >>> that you can not reliable prevent the delete coming out first,
> > > > > >>> before you emit the correct new record. As it is an abstraction
> > > > > >>> entirely
> > > > > >>> around kafka.
> > > > > >>> I can only recommend to not to. Honesty and simplicity should
> > > always
> > > > be
> > > > > >>> first prio
> > > > > >>> trying to hide this just makes it more complex, less
> > understandable
> > > > and
> > > > > >>> will lead to mistakes
> > > > > >>> in usage.
> > > > > >>>
> > > > > >>> Exactly why I am also in big disfavour of GraphNodes and later
> > > > > >>> optimization stages.
> > > > > >>> Can someone give me an example of an optimisation that really
> > can't
> > > > be
> > > > > >>> handled by the user
> > > > > >>> constructing his topology differently?
> > > > > >>> Having reusable Processor API components accessible by the DSL
> > and
> > > > > >>> composable as
> > > > > >>> one likes is exactly where DSL should max out and KSQL should
> do
> > > the
> > > > > >>> next
> > > > > >>> step.
> > > > > >>> I find it very unprofessional from a software engineering
> > approach
> > > > > >>> to run
> > > > > >>> software where
> > > > > >>> you can not at least senseful reason about the inner workings
> of
> > > the
> > > > > >>> libraries used.
> > > > > >>> Gives this people have to read and understand in anyway, why
> try
> > to
> > > > > >>> hide
> > > > > >>> it?
> > > > > >>>
> > > > > >>> It really miss the beauty of 0.10 version DSL.
> > > > > >>> Apparently not a thing I can influence but just warn about.
> > > > > >>>
> > > > > >>> @gouzhang
> > > > > >>> you can't imagine how many extra IQ-Statestores I constantly
> > prune
> > > > from
> > > > > >>> stream app's
> > > > > >>> because people just keep passing Materialized's into all the
> > > > > >>> operations.
> > > > > >>> :D :'-(
> > > > > >>> I regret that I couldn't convince you guys back then. Plus this
> > > whole
> > > > > >>> entire topology as a floating
> > > > > >>> interface chain, never seen it anywhere :-/ :'(
> > > > > >>>
> > > > > >>> I don't know. I guess this is just me regretting to only have
> > > > 24h/day.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> I updated the KIP today with some points worth talking about,
> > > should
> > > > > >>> anyone
> > > > > >>>> be so inclined to check it out. Currently we are running this
> > code
> > > > in
> > > > > >>>> production to handle relational joins from our Kafka Connect
> > > > > >>>> topics, as
> > > > > >>>> per
> > > > > >>>> the original motivation of the KIP.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> I believe the foreignKeyJoin should be responsible for. In my
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Tue, Aug 14, 2018 at 5:22 PM, Guozhang Wang<
> > [email protected]
> > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>> Hello Adam,
> > > > > >>>>> As for your question regarding GraphNodes, it is for
> extending
> > > > > >>>>> Streams
> > > > > >>>>> optimization framework. You can find more details on
> > > > > >>>>> https://issues.apache.org/jira/browse/KAFKA-6761.
> > > > > >>>>>
> > > > > >>>>> The main idea is that instead of directly building up the
> > > "physical
> > > > > >>>>> topology" (represented as Topology in the public package, and
> > > > > >>>>> internally
> > > > > >>>>> built as the ProcessorTopology class) while users are
> > specifying
> > > > the
> > > > > >>>>> transformation operators, we first keep it as a "logical
> > > topology"
> > > > > >>>>> (represented as GraphNode inside InternalStreamsBuilder). And
> > > then
> > > > > >>>>> only
> > > > > >>>>> execute the optimization and the construction of the
> "physical"
> > > > > >>>>> Topology
> > > > > >>>>> when StreamsBuilder.build() is called.
> > > > > >>>>>
> > > > > >>>>> Back to your question, I think it makes more sense to add a
> new
> > > > > >>>>> type of
> > > > > >>>>> StreamsGraphNode (maybe you can consider inheriting from the
> > > > > >>>>> BaseJoinProcessorNode). Note that although in the Topology we
> > > will
> > > > > >>>>> have
> > > > > >>>>> multiple connected ProcessorNodes to represent a
> (foreign-key)
> > > > > >>>>> join, we
> > > > > >>>>> still want to keep it as a single StreamsGraphNode, or just a
> > > > > >>>>> couple of
> > > > > >>>>> them in the logical representation so that in the future we
> can
> > > > > >>>>> construct
> > > > > >>>>> the physical topology differently (e.g. having another way
> than
> > > the
> > > > > >>>>> current
> > > > > >>>>> distributed hash-join).
> > > > > >>>>>
> > > > > >>>>> -------------------------------------------------------
> > > > > >>>>>
> > > > > >>>>> Back to your questions to KIP-213, I think Jan has summarized
> > it
> > > > > >>>>> pretty-well. Note that back then we do not have headers
> support
> > > so
> > > > we
> > > > > >>>>> have
> > > > > >>>>> to do such "key-widening" approach to ensure ordering.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> Guozhang
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Mon, Aug 13, 2018 at 11:39 PM, Jan
> > > > > >>>>> Filipiak<[email protected]>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>> Hi Adam,
> > > > > >>>>>> I love how you are on to this already! I resolve this by
> > > > > >>>>>> "key-widening"
> > > > > >>>>>> I
> > > > > >>>>>> treat the result of FKA,and FKB differently.
> > > > > >>>>>> As you can see the output of my join has a Combined Key and
> > > > > >>>>>> therefore I
> > > > > >>>>>> can resolve the "race condition" in a group by
> > > > > >>>>>> if I so desire.
> > > > > >>>>>>
> > > > > >>>>>> I think this reflects more what happens under the hood and
> > makes
> > > > > >>>>>> it more
> > > > > >>>>>> clear to the user what is going on. The Idea
> > > > > >>>>>> of hiding this behind metadata and handle it in the DSL is
> > from
> > > > > >>>>>> my POV
> > > > > >>>>>> unideal.
> > > > > >>>>>>
> > > > > >>>>>> To write into your example:
> > > > > >>>>>>
> > > > > >>>>>> key + A, null)
> > > > > >>>>>> (key +B, <joined On FK =B>)
> > > > > >>>>>>
> > > > > >>>>>> is what my output would look like.
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Hope that makes sense :D
> > > > > >>>>>>
> > > > > >>>>>> Best Jan
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> On 13.08.2018 18:16, Adam Bellemare wrote:
> > > > > >>>>>>
> > > > > >>>>>> Hi Jan
> > > > > >>>>>>> If you do not use headers or other metadata, how do you
> > ensure
> > > > that
> > > > > >>>>>>> changes
> > > > > >>>>>>> to the foreign-key value are not resolved out-of-order?
> > > > > >>>>>>> ie: If an event has FK = A, but you change it to FK = B,
> you
> > > > > >>>>>>> need to
> > > > > >>>>>>> propagate both a delete (FK=A -> null) and an addition
> > (FK=B).
> > > > > >>>>>>> In my
> > > > > >>>>>>> solution, without maintaining any metadata, it is possible
> > for
> > > > the
> > > > > >>>>>>> final
> > > > > >>>>>>> output to be in either order - the correctly updated joined
> > > > > >>>>>>> value, or
> > > > > >>>>>>>
> > > > > >>>>>> the
> > > > > >>>>>> null for the delete.
> > > > > >>>>>>> (key, null)
> > > > > >>>>>>> (key, <joined On FK =B>)
> > > > > >>>>>>>
> > > > > >>>>>>> or
> > > > > >>>>>>>
> > > > > >>>>>>> (key, <joined On FK =B>)
> > > > > >>>>>>> (key, null)
> > > > > >>>>>>>
> > > > > >>>>>>> I looked back through your code and through the discussion
> > > > > >>>>>>> threads, and
> > > > > >>>>>>> didn't see any information on how you resolved this. I
> have a
> > > > > >>>>>>> version
> > > > > >>>>>>> of
> > > > > >>>>>>> my
> > > > > >>>>>>> code working for 2.0, I am just adding more integration
> tests
> > > > > >>>>>>> and will
> > > > > >>>>>>> update the KIP accordingly. Any insight you could provide
> on
> > > > > >>>>>>> resolving
> > > > > >>>>>>> out-of-order semantics without metadata would be helpful.
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks
> > > > > >>>>>>> Adam
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> On Mon, Aug 13, 2018 at 3:34 AM, Jan Filipiak <
> > > > > >>>>>>> [email protected]
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> Hi,
> > > > > >>>>>>>
> > > > > >>>>>>>> Happy to see that you want to make an effort here.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Regarding the ProcessSuppliers I couldn't find a way to
> not
> > > > > >>>>>>>> rewrite
> > > > > >>>>>>>> the
> > > > > >>>>>>>> joiners + the merger.
> > > > > >>>>>>>> The re-partitioners can be reused in theory. I don't know
> if
> > > > > >>>>>>>>
> > > > > >>>>>>> repartition
> > > > > >>>>>> is optimized in 2.0 now.
> > > > > >>>>>>>> I made this
> > > > > >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 241+
> > > > > >>>>>>>> KTable+repartition+with+compacted+Topics
> > > > > >>>>>>>> back then and we are running KIP-213 with KIP-241 in
> > > > combination.
> > > > > >>>>>>>>
> > > > > >>>>>>>> For us it is vital as it minimized the size we had in our
> > > > > >>>>>>>> repartition
> > > > > >>>>>>>> topics plus it removed the factor of 2 in events on every
> > > > message.
> > > > > >>>>>>>> I know about this new  "delete once consumer has read it".
> > I
> > > > > >>>>>>>> don't
> > > > > >>>>>>>>
> > > > > >>>>>>> think
> > > > > >>>>>> 241 is vital for all usecases, for ours it is. I wanted
> > > > > >>>>>>>> to use 213 to sneak in the foundations for 241 aswell.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I don't quite understand what a PropagationWrapper is,
> but I
> > > am
> > > > > >>>>>>>> certain
> > > > > >>>>>>>> that you do not need RecordHeaders
> > > > > >>>>>>>> for 213 and I would try to leave them out. They either
> > belong
> > > > > >>>>>>>> to the
> > > > > >>>>>>>>
> > > > > >>>>>>> DSL
> > > > > >>>>>> or to the user, having a mixed use is
> > > > > >>>>>>>> to be avoided. We run the join with 0.8 logformat and I
> > don't
> > > > > >>>>>>>> think
> > > > > >>>>>>>> one
> > > > > >>>>>>>> needs more.
> > > > > >>>>>>>>
> > > > > >>>>>>>> This KIP will be very valuable for the streams project! I
> > > > couldn't
> > > > > >>>>>>>>
> > > > > >>>>>>> never
> > > > > >>>>>> convince myself to invest into the 1.0+ DSL
> > > > > >>>>>>>> as I used almost all my energy to fight against it. Maybe
> > this
> > > > can
> > > > > >>>>>>>> also
> > > > > >>>>>>>> help me see the good sides a little bit more.
> > > > > >>>>>>>>
> > > > > >>>>>>>> If there is anything unclear with all the text that has
> been
> > > > > >>>>>>>> written,
> > > > > >>>>>>>> feel
> > > > > >>>>>>>> free to just directly cc me so I don't miss it on
> > > > > >>>>>>>> the mailing list.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Best Jan
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> On 08.08.2018 15:26, Adam Bellemare wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> More followup, and +dev as Guozhang replied to me directly
> > > > > >>>>>>>> previously.
> > > > > >>>>>>>>
> > > > > >>>>>>>>> I am currently porting the code over to trunk. One of the
> > > major
> > > > > >>>>>>>>>
> > > > > >>>>>>>> changes
> > > > > >>>>>> since 1.0 is the usage of GraphNodes. I have a question
> about
> > > > this:
> > > > > >>>>>>>>> For a foreignKey joiner, should it have its own dedicated
> > > node
> > > > > >>>>>>>>> type?
> > > > > >>>>>>>>>
> > > > > >>>>>>>> Or
> > > > > >>>>>> would it be advisable to construct it from existing
> GraphNode
> > > > > >>>>>>>>> components?
> > > > > >>>>>>>>> For instance, I believe I could construct it from several
> > > > > >>>>>>>>> OptimizableRepartitionNode, some SinkNode, some
> SourceNode,
> > > and
> > > > > >>>>>>>>>
> > > > > >>>>>>>> several
> > > > > >>>>>> StatefulProcessorNode. That being said, there is some
> > underlying
> > > > > >>>>>>>>> complexity
> > > > > >>>>>>>>> to each approach.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I will be switching the KIP-213 to use the RecordHeaders
> in
> > > > Kafka
> > > > > >>>>>>>>> Streams
> > > > > >>>>>>>>> instead of the PropagationWrapper, but conceptually it
> > should
> > > > > >>>>>>>>> be the
> > > > > >>>>>>>>> same.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Again, any feedback is welcomed...
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Mon, Jul 30, 2018 at 9:38 AM, Adam Bellemare <
> > > > > >>>>>>>>> [email protected]
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Hi Guozhang et al
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I was just reading the 2.0 release notes and noticed a
> > > section
> > > > on
> > > > > >>>>>>>>>> Record
> > > > > >>>>>>>>>> Headers.
> > > > > >>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > >>>>>>>>>> 244%3A+Add+Record+Header+support+to+Kafka+Streams+
> > > > Processor+API
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I am not yet sure if the contents of a RecordHeader is
> > > > > >>>>>>>>>> propagated
> > > > > >>>>>>>>>> all
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>> way through the Sinks and Sources, but if it is, and if
> it
> > > > > >>>>>>>>>> remains
> > > > > >>>>>>>>>> attached
> > > > > >>>>>>>>>> to the record (including null records) I may be able to
> > > ditch
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>> propagationWrapper for an implementation using
> > RecordHeader.
> > > > > >>>>>>>>>> I am
> > > > > >>>>>>>>>> not
> > > > > >>>>>>>>>> yet
> > > > > >>>>>>>>>> sure if this is doable, so if anyone understands
> > > RecordHeader
> > > > > >>>>>>>>>> impl
> > > > > >>>>>>>>>> better
> > > > > >>>>>>>>>> than I, I would be happy to hear from you.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> In the meantime, let me know of any questions. I believe
> > > this
> > > > > >>>>>>>>>> PR has
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> a
> > > > > >>>>>> lot
> > > > > >>>>>>>>>> of potential to solve problems for other people, as I
> have
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> encountered
> > > > > >>>>>> a
> > > > > >>>>>>>>>> number of other companies in the wild all home-brewing
> > their
> > > > own
> > > > > >>>>>>>>>> solutions
> > > > > >>>>>>>>>> to come up with a method of handling relational data in
> > > > streams.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Adam
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Fri, Jul 27, 2018 at 1:45 AM, Guozhang
> > > > > >>>>>>>>>> Wang<[email protected]>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Hello Adam,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks for rebooting the discussion of this KIP ! Let me
> > > > > >>>>>>>>>> finish my
> > > > > >>>>>>>>>>> pass
> > > > > >>>>>>>>>>> on the wiki and get back to you soon. Sorry for the
> > > delays..
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Guozhang
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Tue, Jul 24, 2018 at 6:08 AM, Adam Bellemare <
> > > > > >>>>>>>>>>> [email protected]
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Let me kick this off with a few starting points that I
> > > > > >>>>>>>>>>>> would like
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> to
> > > > > >>>>>> generate some discussion on.
> > > > > >>>>>>>>>>>> 1) It seems to me that I will need to repartition the
> > data
> > > > > >>>>>>>>>>>> twice -
> > > > > >>>>>>>>>>>> once
> > > > > >>>>>>>>>>>> on
> > > > > >>>>>>>>>>>> the foreign key, and once back to the primary key. Is
> > > there
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> anything
> > > > > >>>>>> I
> > > > > >>>>>>>>>>>> am
> > > > > >>>>>>>>>>>> missing here?
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> 2) I believe I will also need to materialize 3 state
> > > > > >>>>>>>>>>>> stores: the
> > > > > >>>>>>>>>>>> prefixScan
> > > > > >>>>>>>>>>>> SS, the highwater mark SS (for out-of-order
> resolution)
> > > and
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> final
> > > > > >>>>>> state
> > > > > >>>>>>>>>>>> store, due to the workflow I have laid out. I have not
> > > > > >>>>>>>>>>>> thought of
> > > > > >>>>>>>>>>>> a
> > > > > >>>>>>>>>>>> better
> > > > > >>>>>>>>>>>> way yet, but would appreciate any input on this
> matter.
> > I
> > > > have
> > > > > >>>>>>>>>>>> gone
> > > > > >>>>>>>>>>>> back
> > > > > >>>>>>>>>>>> through the mailing list for the previous discussions
> on
> > > > > >>>>>>>>>>>> this KIP,
> > > > > >>>>>>>>>>>> and
> > > > > >>>>>>>>>>>> I
> > > > > >>>>>>>>>>>> did not see anything relating to resolving
> out-of-order
> > > > > >>>>>>>>>>>> compute. I
> > > > > >>>>>>>>>>>> cannot
> > > > > >>>>>>>>>>>> see a way around the current three-SS structure that I
> > > have.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> 3) Caching is disabled on the prefixScan SS, as I do
> not
> > > > > >>>>>>>>>>>> know how
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> to
> > > > > >>>>>> resolve the iterator obtained from rocksDB with that of the
> > > cache.
> > > > > >>>>>>>>>>> In
> > > > > >>>>>> addition, I must ensure everything is flushed before
> scanning.
> > > > > >>>>>>>>>>> Since
> > > > > >>>>>> the
> > > > > >>>>>>>>>>>> materialized prefixScan SS is under "control" of the
> > > > > >>>>>>>>>>>> function, I
> > > > > >>>>>>>>>>>> do
> > > > > >>>>>>>>>>>> not
> > > > > >>>>>>>>>>>> anticipate this to be a problem. Performance
> throughput
> > > > > >>>>>>>>>>>> will need
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> to
> > > > > >>>>>> be
> > > > > >>>>>>>>>>>> tested, but as Jan observed in his initial overview of
> > > this
> > > > > >>>>>>>>>>>> issue,
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> it
> > > > > >>>>>> is
> > > > > >>>>>>>>>>>> generally a surge of output events which affect
> > > performance
> > > > > >>>>>>>>>>>> moreso
> > > > > >>>>>>>>>>>> than
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>> flush or prefixScan itself.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thoughts on any of these are greatly appreciated,
> since
> > > > these
> > > > > >>>>>>>>>>>> elements
> > > > > >>>>>>>>>>>> are
> > > > > >>>>>>>>>>>> really the cornerstone of the whole design. I can put
> up
> > > > > >>>>>>>>>>>> the code
> > > > > >>>>>>>>>>>> I
> > > > > >>>>>>>>>>>> have
> > > > > >>>>>>>>>>>> written against 1.0.2 if we so desire, but first I was
> > > > > >>>>>>>>>>>> hoping to
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> just
> > > > > >>>>>> tackle some of the fundamental design proposals.
> > > > > >>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>> Adam
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Mon, Jul 23, 2018 at 10:05 AM, Adam Bellemare <
> > > > > >>>>>>>>>>>> [email protected]>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Here is the new discussion thread for KIP-213. I
> picked
> > > > > >>>>>>>>>>>> back up on
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>> KIP
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> as this is something that we too at Flipp are now
> > running
> > > in
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> production.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Jan started this last year, and I know that Trivago is
> > > also
> > > > > >>>>>>>>>>>> using
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> something
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>> similar in production, at least in terms of APIs and
> > > > > >>>>>>>>>>>> functionality.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> > > > > >>>>>>>>>>>>> 213+Support+non-key+joining+in+KTable
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> I do have an implementation of the code for Kafka
> 1.0.2
> > > > (our
> > > > > >>>>>>>>>>>>> local
> > > > > >>>>>>>>>>>>> production version) but I won't post it yet as I
> would
> > > > > >>>>>>>>>>>>> like to
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>> focus
> > > > > >>>>>> on the
> > > > > >>>>>>>>>>>> workflow and design first. That being said, I also
> need
> > to
> > > > add
> > > > > >>>>>>>>>>>> some
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> clearer
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>> integration tests (I did a lot of testing using a
> > > non-Kafka
> > > > > >>>>>>>>>>>> Streams
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> framework) and clean up the code a bit more before
> > > putting
> > > > > >>>>>>>>>>>>> it in
> > > > > >>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>> PR
> > > > > >>>>>>>>>>>>> against trunk (I can do so later this week likely).
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Please take a look,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Thanks
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Adam Bellemare
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> --
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>> -- Guozhang
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> --
> > > > > >>>>> -- Guozhang
> > > > > >>>>>
> > > > > >>>>>
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: KIP-213 - Scalable/Usable Foreign-Key KTable joins - Rebooted.

Reply via email to