Fwd: Complex graph-based sessionization (potential use for stateful functions)

Robert Metzger Tue, 31 Mar 2020 08:13:40 -0700

Forwarding Seth's answer to the list

---------- Forwarded message ---------
From: Seth Wiesman <s...@ververica.com>
Date: Tue, Mar 31, 2020 at 4:47 PM
Subject: Re: Complex graph-based sessionization (potential use for stateful
functions)
To: Krzysztof Zarzycki <k.zarzy...@gmail.com>
Cc: user <user@flink.apache.org>, <rmetz...@apache.org>

Hi Krzysztof,

This is a great use case for Stateful Functions. I have actually been
considering adding a graph algorithm example to the statefun repo for some
time now.

StateFun does use iteration under the hood and provides exactly-once
guarantees. In-flight records will never be lost in the case of failure.
>From a user code perspective, the api offers arbitrary message passing
between different functions (stateful virtual actors).

For a rough sketch of what this would look like; you could create a
function called `Vertex` that represents a single vertice on the graph. Its
state would the edges, all vertices reachable from that point, We now have
a distributed, fault-tolerant, adjacency list. You can implement whatever
graph algorithm you like on top of this structure. Walking the graph would
just be starting from a point, and messaging the vertices stored in state.

Just in case you are not aware, the community is currently in the process
of releasing the first Apache release of StateFun and it should hopefully
be out by the end of this week. Just to say the API is stable and you can
start developing on top of it.

On Mon, Mar 30, 2020 at 6:00 PM Krzysztof Zarzycki <k.zarzy...@gmail.com>
wrote:

> Hi!  Interesting problem to solve ahead :)
> I need to implement a streaming sessionization algorithm (split stream of
> events into groups of correlated events). It's pretty non-standard as we
> DON'T have a key like user id which separates the stream into substreams
> which we just need to chunk based on time.
> Instead and simplifying a lot, our events bear tuples, that I compare to
> graph edges, e.g.:
> event 1: A -> B
> event 2: B -> C
> event 3: D -> E
> event 4: D -> F
> event 5: G -> F
> I need to group them into subgroups reachable by following these edges
> from some specific nodes. E.g. here:
> { A->B, B->C}
> { D->E, D->F}
> { G->F }
> (note: I need to group the events, which are represented by edges here,
> not the nodes).
> As far as I understand, to solve this problem I need to leverage feedback
> loops/iterations feature in Flink (Generally I believe I need to apply a
> Bulk Synchronous Processing approach).
>
> Does anyone have seen this kind of sessionization implemented in the wild?
> Would you suggest implementing such an algorithm using *stateful
> functions*? (AFAIK, they use feedback loops underneath). Can you suggest
> how would these be used here?
> I know there are some problems with checkpointing when using iterations,
> does it mean the implementation may experience data loss on stops?
>
> Side comment: I'm not sure which graph algorithm derivative needs to be
> applied here, but the candidate is transitive closure.
>
> Thanks for joining the discussion!
> Krzysztof
>
>

-- 

Seth Wiesman | Solutions Architect

+1 314 387 1463

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

Fwd: Complex graph-based sessionization (potential use for stateful functions)

Reply via email to