Hi Rion, I think FLIP-150[1] should be able to solve this scenario.
Since FLIP-150 is still under discussion, for now a temporary method come to me might be 1. Write a first job to read the kafka and update the broadcast state of some operator. The job would keep the source alive after all the data are emit (like sleep forever), and when all the data are processed, then stop the job with savepoint. 2. Use the savepoint to start the original job. For the operator required the broadcast state, it could set the same uid and same state name with the corresponding operator in the first job, so it could acqure the state content on startup. Yun, Best [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-150%3A+Introduce+Hybrid+Source ------------------Original Mail ------------------ Sender:Rion Williams <rionmons...@gmail.com> Send Date:Mon May 17 07:00:03 2021 Recipients:user <user@flink.apache.org> Subject:Re: Handling "Global" Updating State Hey folks, After digging into this a bit it does seem like Broadcast State would fit the bill for this scenario and keeping the downstream operators up-to-date as messages arrived in my Kafka topic. My question is - is there a pattern for pre-populating the state initially? In my case, I need to have loaded all of my “lookup” topic into state before processing any records in the other stream. My thought initially is to do something like this, if it’s possible: - Create a KafkaConsumer on startup to read the lookup topic in its entirety into some collection like a hashmap (prior to executing the Flink pipeline to ensure synchronicity) - Use this to initialize the state of my broadcast stream (if possible) - At this point that stream would be broadcasting any new records coming in, so I “should” stay up to date at that point. Is this an oversimplification or is there an obviously better / well known approach to handling this? Thanks, Rion On May 14, 2021, at 9:51 AM, Rion Williams <rionmons...@gmail.com> wrote: Hi all, I've encountered a challenge within a Flink job that I'm currently working on. The gist of it is that I have a job that listens to a series of events from a Kafka topic and eventually sinks those down into Postgres via the JDBCSink. A requirement recently came up for the need to filter these events based on some configurations that are currently being stored within another Kafka topic. I'm wondering what the best approach might be to handle this type of problem. My initial naive approach was: When Flink starts up, use a regular Kafka Consumer and read all of the configuration data from that topic in its entirety. Store the messages from that topic in some type of thread-safe collection statically accessible by the operators downstream. Expose the thread-safe collection within the operators to actually perform the filtering. This doesn't seem right though. I was reading about BroadcastState which seems like it might fit the bill (e.g. keep those mappings in Broadcast state so that all of the downstream operations would have access to them, which I'd imagine would handle keeping things up to date). Does Flink have a good pattern / construct to handle this? Basically, I have a series of mappings that I want to keep relatively up to date in a Kafka topic, and I'm consuming from another Kafka topic that will need those mappings to filter against. I'd be happy to share some of the approaches I currently have or elaborate a bit more if that isn't super clear. Thanks much, Rion