Hi Yaroslav,

I think your approach is correct. Union is perfect to implement multiway joins if you normalize the type of all streams before. It can simply be a composite type with the key and a member variable for each stream where only one of those variables is not null. A keyed process function can then perform the actual joining.

Regards,
Timo


On 25.02.21 23:29, Yaroslav Tkachenko wrote:
Hello everyone,

I have a question about implementing a join of N datastreams (where N > 2) without any time guarantees. According to my requirements, late data is not tolerable, so if I have a join between stream A and stream B and a message with key X arrives in stream B one year after arriving in stream A, it still should be able to match and emit the result.

I'm currently working on a pipeline with 9 sources and 2 intermediate joins, so the DAG looks like this:

stream 1 ----->
stream 2 ----->
stream 3 -----> join1 ----->
stream 4 ----->
stream 5 ----->
stream 6 -----> join2 ----->
stream 7 ------------------>
stream 8 ------------------>
stream 9 ------------------> final join

I'm approaching this by creating state variables for all inputs and using them for lookups (I'm fine with constantly growing keyspace). In terms of actually joining data streams I see 3 options:
- coGroup with a GlobalWindow and triggers
- connect
- union

coGroup and connect only support two inputs, so implementing the DAG I described above is very cumbersome. Also, coGroup needs windowing semantics (even if it's just a single GlobalWindow), and in my experience with other frameworks, this could add some overhead.

union only works on inputs of the same type, but this can be solved by introducing a wrapper class.

I'm able to get the union approach working, but I'm still not sure if it's the best way to implement this pipeline. Any suggestions?

Thank you!


Reply via email to