Hello everyone, I have a question about implementing a join of N datastreams (where N > 2) without any time guarantees. According to my requirements, late data is not tolerable, so if I have a join between stream A and stream B and a message with key X arrives in stream B one year after arriving in stream A, it still should be able to match and emit the result.
I'm currently working on a pipeline with 9 sources and 2 intermediate joins, so the DAG looks like this: stream 1 -----> stream 2 -----> stream 3 -----> join1 -----> stream 4 -----> stream 5 -----> stream 6 -----> join2 -----> stream 7 ------------------> stream 8 ------------------> stream 9 ------------------> final join I'm approaching this by creating state variables for all inputs and using them for lookups (I'm fine with constantly growing keyspace). In terms of actually joining data streams I see 3 options: - coGroup with a GlobalWindow and triggers - connect - union coGroup and connect only support two inputs, so implementing the DAG I described above is very cumbersome. Also, coGroup needs windowing semantics (even if it's just a single GlobalWindow), and in my experience with other frameworks, this could add some overhead. union only works on inputs of the same type, but this can be solved by introducing a wrapper class. I'm able to get the union approach working, but I'm still not sure if it's the best way to implement this pipeline. Any suggestions? Thank you!