Hi! Interesting problem to solve ahead :) I need to implement a streaming sessionization algorithm (split stream of events into groups of correlated events). It's pretty non-standard as we DON'T have a key like user id which separates the stream into substreams which we just need to chunk based on time. Instead and simplifying a lot, our events bear tuples, that I compare to graph edges, e.g.: event 1: A -> B event 2: B -> C event 3: D -> E event 4: D -> F event 5: G -> F I need to group them into subgroups reachable by following these edges from some specific nodes. E.g. here: { A->B, B->C} { D->E, D->F} { G->F } (note: I need to group the events, which are represented by edges here, not the nodes). As far as I understand, to solve this problem I need to leverage feedback loops/iterations feature in Flink (Generally I believe I need to apply a Bulk Synchronous Processing approach).
Does anyone have seen this kind of sessionization implemented in the wild? Would you suggest implementing such an algorithm using *stateful functions*? (AFAIK, they use feedback loops underneath). Can you suggest how would these be used here? I know there are some problems with checkpointing when using iterations, does it mean the implementation may experience data loss on stops? Side comment: I'm not sure which graph algorithm derivative needs to be applied here, but the candidate is transitive closure. Thanks for joining the discussion! Krzysztof