Hi,

Imagine I have one class having 4 fields: ID, A, B, C.  There are three
data sources providing data in the form of (ID, A), (ID, B), (ID, C)
respectively. I want to join these three data sources to get final (ID, A,
B, C) without any window. For example, (ID, A) could come one month after
(ID, B). Such joining needs global states. There are two designs in my mind.

1. Stream connect with separated kafka topic
streamA_B = DataSourceA connect DataSourceB
streamA_B_C = streamA_B connect DataSourceC

Each data source is ingested via a dedicated kafka topic. This design seems
not scalable because I need N stream connect operations for N+1 data
sources. Each stream connect needs to maintain a global state. For example,
streamA_B needs a global state for maintaining (ID, A, B) and streamA_B_C
needs another for maintaining (ID, A, B, C).

2. Shared kafka topic
All data sources are ingested via a shared kafka topic (using union event
type or schema reference). Then one Flink job can handle all events from
these data sources by maintaining one global state. This design seems more
scalable than solution 1.

Which one is recommended? Is there a better way that is missed? Appreciate
very much for any hints!

Reply via email to