Re: Implementing a “join” between a DataStream and a “set of rules”

Aljoscha Krettek Tue, 05 Jun 2018 09:05:48 -0700

Hi,

you might be interested in this newly-introduced feature: 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/state/broadcast_state.html
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/state/broadcast_state.html>


Best,
Aljoscha

> On 5. Jun 2018, at 14:46, Amit Jain <aj201...@gmail.com> wrote:
> 
> Hi Sandybayev,
> 
> In the current state, Flink does not provide a solution to the
> mentioned use case. However, there is open FLIP[1] [2] which has been
> created to address the same.
> 
> I can see in your current approach, you are not able to update the
> rule set data. I think you can update rule set data by building
> DataStream around changelogs which are stored in message
> queue/distributed file system.
> OR
> You can store rule set data in the external system where you can query
> for incoming keys from Flink.
> 
> [1]: 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
> [2]: https://issues.apache.org/jira/browse/FLINK-6131
> 
> On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta)
> <turar.sandyba...@coxautoinc.com> wrote:
>> Hi,
>> 
>> 
>> 
>> What is the best practice recommendation for the following use case? We need
>> to match a stream against a set of “rules”, which are essentially a Flink
>> DataSet concept. Updates to this “rules set" are possible but not frequent.
>> Each stream event must be checked against all the records in “rules set”,
>> and each match produces one or more events into a sink. Number of records in
>> a rule set are in the 6 digit range.
>> 
>> 
>> 
>> Currently we're simply loading rules into a local List of rules and using
>> flatMap over an incoming DataStream. Inside flatMap, we're just iterating
>> over a list comparing each event to each rule.
>> 
>> 
>> 
>> To speed up the iteration, we can also split the list into several batches,
>> essentially creating a list of lists, and creating a separate thread to
>> iterate over each sub-list (using Futures in either Java or Scala).
>> 
>> 
>> 
>> Questions:
>> 
>> 1.            Is there a better way to do this kind of a join?
>> 
>> 2.            If not, is it safe to add additional parallelism by creating
>> new threads inside each flatMap operation, on top of what Flink is already
>> doing?
>> 
>> 
>> 
>> Thanks in advance!
>> 
>> Turar
>> 
>>

Re: Implementing a “join” between a DataStream and a “set of rules”

Reply via email to