Hi Yaroslav,
I think your approach is correct. Union is perfect to implement multiway
joins if you normalize the type of all streams before. It can simply be
a composite type with the key and a member variable for each stream
where only one of those variables is not null. A keyed process function
can then perform the actual joining.
Regards,
Timo
On 25.02.21 23:29, Yaroslav Tkachenko wrote:
Hello everyone,
I have a question about implementing a join of N datastreams (where N >
2) without any time guarantees. According to my requirements, late data
is not tolerable, so if I have a join between stream A and stream B and
a message with key X arrives in stream B one year after arriving in
stream A, it still should be able to match and emit the result.
I'm currently working on a pipeline with 9 sources and 2 intermediate
joins, so the DAG looks like this:
stream 1 ----->
stream 2 ----->
stream 3 -----> join1 ----->
stream 4 ----->
stream 5 ----->
stream 6 -----> join2 ----->
stream 7 ------------------>
stream 8 ------------------>
stream 9 ------------------> final join
I'm approaching this by creating state variables for all inputs and
using them for lookups (I'm fine with constantly growing keyspace). In
terms of actually joining data streams I see 3 options:
- coGroup with a GlobalWindow and triggers
- connect
- union
coGroup and connect only support two inputs, so implementing the DAG I
described above is very cumbersome. Also, coGroup needs windowing
semantics (even if it's just a single GlobalWindow), and in my
experience with other frameworks, this could add some overhead.
union only works on inputs of the same type, but this can be solved by
introducing a wrapper class.
I'm able to get the union approach working, but I'm still not sure if
it's the best way to implement this pipeline. Any suggestions?
Thank you!