Hello everyone,

I have a question about implementing a join of N datastreams (where N > 2)
without any time guarantees. According to my requirements, late data is not
tolerable, so if I have a join between stream A and stream B and a message
with key X arrives in stream B one year after arriving in stream A, it
still should be able to match and emit the result.

I'm currently working on a pipeline with 9 sources and 2 intermediate
joins, so the DAG looks like this:

stream 1 ----->
stream 2 ----->
stream 3 -----> join1 ----->
stream 4 ----->
stream 5 ----->
stream 6 -----> join2 ----->
stream 7 ------------------>
stream 8 ------------------>
stream 9 ------------------> final join

I'm approaching this by creating state variables for all inputs and using
them for lookups (I'm fine with constantly growing keyspace). In terms of
actually joining data streams I see 3 options:
- coGroup with a GlobalWindow and triggers
- connect
- union

coGroup and connect only support two inputs, so implementing the DAG I
described above is very cumbersome. Also, coGroup needs windowing semantics
(even if it's just a single GlobalWindow), and in my experience with other
frameworks, this could add some overhead.

union only works on inputs of the same type, but this can be solved by
introducing a wrapper class.

I'm able to get the union approach working, but I'm still not sure if it's
the best way to implement this pipeline. Any suggestions?

Thank you!

Reply via email to