Forwarding Seth's answer to the list ---------- Forwarded message --------- From: Seth Wiesman <s...@ververica.com> Date: Tue, Mar 31, 2020 at 4:47 PM Subject: Re: Complex graph-based sessionization (potential use for stateful functions) To: Krzysztof Zarzycki <k.zarzy...@gmail.com> Cc: user <user@flink.apache.org>, <rmetz...@apache.org>
Hi Krzysztof, This is a great use case for Stateful Functions. I have actually been considering adding a graph algorithm example to the statefun repo for some time now. StateFun does use iteration under the hood and provides exactly-once guarantees. In-flight records will never be lost in the case of failure. >From a user code perspective, the api offers arbitrary message passing between different functions (stateful virtual actors). For a rough sketch of what this would look like; you could create a function called `Vertex` that represents a single vertice on the graph. Its state would the edges, all vertices reachable from that point, We now have a distributed, fault-tolerant, adjacency list. You can implement whatever graph algorithm you like on top of this structure. Walking the graph would just be starting from a point, and messaging the vertices stored in state. Just in case you are not aware, the community is currently in the process of releasing the first Apache release of StateFun and it should hopefully be out by the end of this week. Just to say the API is stable and you can start developing on top of it. On Mon, Mar 30, 2020 at 6:00 PM Krzysztof Zarzycki <k.zarzy...@gmail.com> wrote: > Hi! Interesting problem to solve ahead :) > I need to implement a streaming sessionization algorithm (split stream of > events into groups of correlated events). It's pretty non-standard as we > DON'T have a key like user id which separates the stream into substreams > which we just need to chunk based on time. > Instead and simplifying a lot, our events bear tuples, that I compare to > graph edges, e.g.: > event 1: A -> B > event 2: B -> C > event 3: D -> E > event 4: D -> F > event 5: G -> F > I need to group them into subgroups reachable by following these edges > from some specific nodes. E.g. here: > { A->B, B->C} > { D->E, D->F} > { G->F } > (note: I need to group the events, which are represented by edges here, > not the nodes). > As far as I understand, to solve this problem I need to leverage feedback > loops/iterations feature in Flink (Generally I believe I need to apply a > Bulk Synchronous Processing approach). > > Does anyone have seen this kind of sessionization implemented in the wild? > Would you suggest implementing such an algorithm using *stateful > functions*? (AFAIK, they use feedback loops underneath). Can you suggest > how would these be used here? > I know there are some problems with checkpointing when using iterations, > does it mean the implementation may experience data loss on stops? > > Side comment: I'm not sure which graph algorithm derivative needs to be > applied here, but the candidate is transitive closure. > > Thanks for joining the discussion! > Krzysztof > > -- Seth Wiesman | Solutions Architect +1 314 387 1463 <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time