I'm evaluating Flink for a reporting application that will keep various aggregates updated in a database. It will be consuming from Kafka queues that are replicated from remote data centers, so in case there is a long outage in replication, I need to decide what to do about windowing and late data.
If I use Flink's built-in windows and watermarks, any late data will be come in 1-element windows, which could overwhelm the database if a large batch of late data comes in and they are each mapped to individual database updates. As far as I can tell, I have two options: 1. Ignore late data, by marking it as late in an AssignerWithPunctuatedWatermarks function, and then discarding it in a flatMap operator. In this scenario, I would rely on a batch process to fill in the missing data later, in the lambda architecture style. 2. Implement my own watermark logic to allow full windows of late data. It seems like I could, for example, emit a "tick" message that is replicated to all partitions every n messages, and then a custom Trigger could decide when to purge each window based on the ticks and a timeout duration. The system would never emit a real Watermark. My questions are: - Am I mistaken about either of these, or are there any other options I'm not seeing for avoiding 1-element windows? - For option 2, are there any problems with not emitting actual watermarks, as long as the windows are eventually purged by a trigger? Thanks, Mike
