Hi Aljoscha, Thank you, this seems like a match for this use case. Am I understanding correctly that since only MemoryStateBackend is available for broadcast state, the max amount possible is 5MB?
If I use Flink state mechanism for storing rules, I will still need to iterate through all rules inside of a flatMap, and there’s no higher-level join mechanism that I can employ, right? Is there any downside in trying to parallelize that iteration inside my user flatMap operation? Thanks Turar From: Aljoscha Krettek <aljos...@apache.org> Date: Tuesday, June 5, 2018 at 12:05 PM To: Amit Jain <aj201...@gmail.com> Cc: "Sandybayev, Turar (CAI - Atlanta)" <turar.sandyba...@coxautoinc.com>, "user@flink.apache.org" <user@flink.apache.org> Subject: Re: Implementing a “join” between a DataStream and a “set of rules” Hi, you might be interested in this newly-introduced feature: https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/state/broadcast_state.html Best, Aljoscha On 5. Jun 2018, at 14:46, Amit Jain <aj201...@gmail.com<mailto:aj201...@gmail.com>> wrote: Hi Sandybayev, In the current state, Flink does not provide a solution to the mentioned use case. However, there is open FLIP[1] [2] which has been created to address the same. I can see in your current approach, you are not able to update the rule set data. I think you can update rule set data by building DataStream around changelogs which are stored in message queue/distributed file system. OR You can store rule set data in the external system where you can query for incoming keys from Flink. [1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API [2]: https://issues.apache.org/jira/browse/FLINK-6131 On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta) <turar.sandyba...@coxautoinc.com<mailto:turar.sandyba...@coxautoinc.com>> wrote: Hi, What is the best practice recommendation for the following use case? We need to match a stream against a set of “rules”, which are essentially a Flink DataSet concept. Updates to this “rules set" are possible but not frequent. Each stream event must be checked against all the records in “rules set”, and each match produces one or more events into a sink. Number of records in a rule set are in the 6 digit range. Currently we're simply loading rules into a local List of rules and using flatMap over an incoming DataStream. Inside flatMap, we're just iterating over a list comparing each event to each rule. To speed up the iteration, we can also split the list into several batches, essentially creating a list of lists, and creating a separate thread to iterate over each sub-list (using Futures in either Java or Scala). Questions: 1. Is there a better way to do this kind of a join? 2. If not, is it safe to add additional parallelism by creating new threads inside each flatMap operation, on top of what Flink is already doing? Thanks in advance! Turar