I am looking for some design advice for a new Flink application and I am relatively new to Flink - I have one, fairly straightforward Flink application in production so far.
For this new application, I want to create a three-stage processing pipeline. Functionally, I am seeing this as ONE long datastream. But, I have to evaluate the STAGE-1 data in a special manner to then pass on that evaluation to STAGE-2 where it will do its own special evaluation using the STAGE-1 evaluation results to shape its evaluation. The same thing happens again in STAGE-3, using the STAGE-2 evaluation results. Finally, the end result is published to Kafka. The stages functionally look like this: STAGE-1 KafkaSource |=> Keyby => TumblingWindows1 => ProcessWindowFn => SideOutput-1 |=> SessionWindow1 => ProcessWindowFn => (SideOutput-2[WindowRecords], KafkaSink[EvalResult]) |=================> WindowAll => ProcessWindowFn => SideOutput-1 ^ STAGE-2 SideOutput-2 => Keyby => TumblingWindows2 => ProcessWindowFn => SideOutput-3 => SessionWindow2 => ProcessWindowFn => (SideOutput-4[WindowRecords], KafkaSink[EvalResult]) STAGE-3 SideOutput-4 => Keyby => TumblingWindows3 => ProcessWindowFn => SideOutput-5 => SessionWindow3 => ProcessWindowFn => KafkaSink DESCRIPTION In STAGE-1, there are a fixed number of known keys so I will only see at most about 21 distinct keys and therefore up to 21 tumbling one-minute windows. I also need to aggregate all data in a global window to get an overall non-keyed result. I need to bring the 21 results from those 21 tumbling windows AND the one global result into one place where I can compare each of the 21 windows results to the one global result. Based on this evaluation, only some of the 21 windows results will survive that test. I want to then take the data records from those, say 3 surviving windows, and make them the "source" for STAGE-2 processing as well as publish some intermediate evaluation results to a KafkaSink. STAGE-2 will reprocess the same data records that the three STAGE-1 surviving windows processed, only keying them by different dimensions. I expect there to be around 4000 fairly small records per each of the 21 STAGE-1 windows so, in this example, I would be sending 4000 x 3 = 12000 records in SideOutput-2 to form the new "source" datastream for STAGE-2. Where I am struggling is: 1. Trying to figure out how to best connect the output of the 21 STAGE-1 windows and the one WIndowAll window records into a single point (I propose SessionWindow1) to be able to compare each of the 21 windows data results with the WindowAll non-keyed results. 2. The best way to connect together these multiple stages. Looking at the STAGE-1 approach illustrated above, this is my attempt at an approach using side outputs to: 1. Form a new "source" data stream that contains the outputs of each of the 21 windows and the WindowAll data 2. Consume that into a single session window 3. Do the evaluations between the 21 keyed windows against the overall WindowAll data 4. Then emit only the 3 surviving sets of data from the 3 tumbling windows outputs from the ProcessWindowFn to SideOutput-2 and the evaluation results to Kafka 5. Finally, SideOutput-2 will then form the new data stream "source" for STAGE-2 where a similar process will repeat, passing data to a STAGE-3, again similar processing, to finally obtain the desired result that will be published to Kafka. I would greatly appreciate the following: 1. Comments on if this is a valid approach - am I on the right track here? 2. Could you suggest an alternate approach that I could investigate if this is problematic?. I am trying to build a Flink application that follows intended best practices so I am just looking for some confirmation that I am heading down a reasonable path for this design. Thank you in advance, Mark