Re: In-Memory Lookup in Flink Operators

Chirag Dewan Sun, 30 Sep 2018 01:48:47 -0700

 Thanks Lasse, that is rightly put. That's the only solution I can think of too.
Only thing which I can't get my head around is using the coMap and coFlatMap 
functions with such a stream. Since they dont support side outputs, is there a 
way my lookup map/flatmap function simply consume a stream? 
Ken, thats an interesting solution actually. Is there any chance you need to 
update the memory-loaded data too? 
Thanks,
Chirag
   On Sunday, 30 September, 2018, 5:17:51 AM IST, Ken Krugler 
<kkrugler_li...@transpac.com> wrote: 
 
 Hi Lasse,
One approach I’ve used in a similar situation is to have a “UnionedSource” 
wrapper that first emits the (bounded) data that will be loaded in-memory, and 
then starts running the source that emits the continuous stream of data.
This outputs an Either<A, B>, which I then split, and broadcast the A, and 
key/partition the B.
You could do something similar, but occasionally keep checking if there’s more 
<A> data vs. assuming it’s bounded.
The main issue I ran into is that it doesn’t seem possible to do checkpointing, 
or at least I couldn’t think of a way to make this work properly.
— Ken



On Sep 27, 2018, at 9:50 PM, Lasse Nedergaard <lassenederga...@gmail.com> wrote:

Hi. 
We have created our own database source that pools the data with a configured 
interval. We then use a co processed function. It takes to input one from our 
database and one from our data input. I require that you keyby with the 
attributes you use lookup in your map function. To delay your data input until 
your database lookup is done first time is not simple but a simple solution 
could be to implement a delay operation or keep the data in your process 
function until data arrive from your database stream. 

Med venlig hilsen / Best regardsLasse Nedergaard

Den 28. sep. 2018 kl. 06.28 skrev Chirag Dewan <chirag.dewa...@yahoo.in>:


Hi,
I saw Apache Flink User Mailing List archive. - static/dynamic lookups in flink 
streaming being discussed, and then I saw this FLIP 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API.
 
I know we havent made much progress on this topic. I still wanted to put 
forward my problem statement around this. 
I am also looking for a dynamic lookup in Flink operators. I actually want to 
pre-fetch various Data Sources, like DB, Filesystem, Cassandra etc. into 
memory. Along with that, I have to ensure a refresh of in-memory lookup table 
periodically. The period being a configurable parameter. 
This is what a map operator would look like with lookup: 
-> Load in-memory lookup - Refresh timer start-> Stream processing start-> Call 
lookup-> Use lookup result in Stream processing
-> Timer elapsed -> Reload lookup data source into in-memory table-> Continue 
processing

 My concern around these are : 
1) Possibly storing the same copy of data in every Task slots memory or state 
backend(RocksDB in my case).2) Having a dedicated refresh thread for each 
subtask instance(possibly, every Task Manager having multiple refresh thread)
Am i thinking in the right direction? Or missing something very obvious? It 
confusing.
Any leads are much appreciated. Thanks in advance.
Cheers, Chirag

--------------------------Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: In-Memory Lookup in Flink Operators

Reply via email to