Re: In-Memory Lookup in Flink Operators

Ken Krugler Sat, 29 Sep 2018 16:48:02 -0700

Hi Lasse,

One approach I’ve used in a similar situation is to have a “UnionedSource” 
wrapper that first emits the (bounded) data that will be loaded in-memory, and 
then starts running the source that emits the continuous stream of data.


This outputs an Either<A, B>, which I then split, and broadcast the A, and 
key/partition the B.

You could do something similar, but occasionally keep checking if there’s more 
<A> data vs. assuming it’s bounded.

The main issue I ran into is that it doesn’t seem possible to do checkpointing, 
or at least I couldn’t think of a way to make this work properly.

— Ken


> On Sep 27, 2018, at 9:50 PM, Lasse Nedergaard <lassenederga...@gmail.com> 
> wrote:
> 
> Hi. 
> 
> We have created our own database source that pools the data with a configured 
> interval. We then use a co processed function. It takes to input one from our 
> database and one from our data input. I require that you keyby with the 
> attributes you use lookup in your map function. 
> To delay your data input until your database lookup is done first time is not 
> simple but a simple solution could be to implement a delay operation or keep 
> the data in your process function until data arrive from your database 
> stream. 
> 
> Med venlig hilsen / Best regards
> Lasse Nedergaard
> 
> 
> Den 28. sep. 2018 kl. 06.28 skrev Chirag Dewan <chirag.dewa...@yahoo.in 
> <mailto:chirag.dewa...@yahoo.in>>:
> 
>> Hi,
>> 
>> I saw Apache Flink User Mailing List archive. - static/dynamic lookups in 
>> flink streaming 
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/static-dynamic-lookups-in-flink-streaming-td10726.html>
>>  being discussed, and then I saw this FLIP 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
>>  
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API>.
>>  
>> 
>> I know we havent made much progress on this topic. I still wanted to put 
>> forward my problem statement around this. 
>> 
>> I am also looking for a dynamic lookup in Flink operators. I actually want 
>> to pre-fetch various Data Sources, like DB, Filesystem, Cassandra etc. into 
>> memory. Along with that, I have to ensure a refresh of in-memory lookup 
>> table periodically. The period being a configurable parameter. 
>> 
>> This is what a map operator would look like with lookup: 
>> 
>> -> Load in-memory lookup - Refresh timer start
>> -> Stream processing start
>> -> Call lookup
>> -> Use lookup result in Stream processing
>> -> Timer elapsed -> Reload lookup data source into in-memory table
>> -> Continue processing
>> 
>> 
>>  My concern around these are : 
>> 
>> 1) Possibly storing the same copy of data in every Task slots memory or 
>> state backend(RocksDB in my case).
>> 2) Having a dedicated refresh thread for each subtask instance(possibly, 
>> every Task Manager having multiple refresh thread)
>> 
>> Am i thinking in the right direction? Or missing something very obvious? It 
>> confusing.
>> 
>> Any leads are much appreciated. Thanks in advance.
>> 
>> Cheers, 
>> Chirag

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: In-Memory Lookup in Flink Operators

Reply via email to