Hi Amit, In my current approach the idea for updating rule set data was to have some kind of a "control" stream that will trigger an update to a local data structure, or a "control" event within the main data stream that will trigger the same.
Using external system like a cache or database is also an option, but that still will require some kind of a trigger to reload rule set or a single rule, in case of any updates to it. Others have suggested using Flink managed state, but I'm still not sure whether that is a generally recommended approach in this scenario, as it seems like it was more meant for windowing-type processing instead? Thanks, Turar On 6/5/18, 8:46 AM, "Amit Jain" <aj201...@gmail.com> wrote: Hi Sandybayev, In the current state, Flink does not provide a solution to the mentioned use case. However, there is open FLIP[1] [2] which has been created to address the same. I can see in your current approach, you are not able to update the rule set data. I think you can update rule set data by building DataStream around changelogs which are stored in message queue/distributed file system. OR You can store rule set data in the external system where you can query for incoming keys from Flink. [1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API [2]: https://issues.apache.org/jira/browse/FLINK-6131 On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta) <turar.sandyba...@coxautoinc.com> wrote: > Hi, > > > > What is the best practice recommendation for the following use case? We need > to match a stream against a set of “rules”, which are essentially a Flink > DataSet concept. Updates to this “rules set" are possible but not frequent. > Each stream event must be checked against all the records in “rules set”, > and each match produces one or more events into a sink. Number of records in > a rule set are in the 6 digit range. > > > > Currently we're simply loading rules into a local List of rules and using > flatMap over an incoming DataStream. Inside flatMap, we're just iterating > over a list comparing each event to each rule. > > > > To speed up the iteration, we can also split the list into several batches, > essentially creating a list of lists, and creating a separate thread to > iterate over each sub-list (using Futures in either Java or Scala). > > > > Questions: > > 1. Is there a better way to do this kind of a join? > > 2. If not, is it safe to add additional parallelism by creating > new threads inside each flatMap operation, on top of what Flink is already > doing? > > > > Thanks in advance! > > Turar > >