Hi Lasse, One approach I’ve used in a similar situation is to have a “UnionedSource” wrapper that first emits the (bounded) data that will be loaded in-memory, and then starts running the source that emits the continuous stream of data.
This outputs an Either<A, B>, which I then split, and broadcast the A, and key/partition the B. You could do something similar, but occasionally keep checking if there’s more <A> data vs. assuming it’s bounded. The main issue I ran into is that it doesn’t seem possible to do checkpointing, or at least I couldn’t think of a way to make this work properly. — Ken > On Sep 27, 2018, at 9:50 PM, Lasse Nedergaard <lassenederga...@gmail.com> > wrote: > > Hi. > > We have created our own database source that pools the data with a configured > interval. We then use a co processed function. It takes to input one from our > database and one from our data input. I require that you keyby with the > attributes you use lookup in your map function. > To delay your data input until your database lookup is done first time is not > simple but a simple solution could be to implement a delay operation or keep > the data in your process function until data arrive from your database > stream. > > Med venlig hilsen / Best regards > Lasse Nedergaard > > > Den 28. sep. 2018 kl. 06.28 skrev Chirag Dewan <chirag.dewa...@yahoo.in > <mailto:chirag.dewa...@yahoo.in>>: > >> Hi, >> >> I saw Apache Flink User Mailing List archive. - static/dynamic lookups in >> flink streaming >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/static-dynamic-lookups-in-flink-streaming-td10726.html> >> being discussed, and then I saw this FLIP >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API >> >> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API>. >> >> >> I know we havent made much progress on this topic. I still wanted to put >> forward my problem statement around this. >> >> I am also looking for a dynamic lookup in Flink operators. I actually want >> to pre-fetch various Data Sources, like DB, Filesystem, Cassandra etc. into >> memory. Along with that, I have to ensure a refresh of in-memory lookup >> table periodically. The period being a configurable parameter. >> >> This is what a map operator would look like with lookup: >> >> -> Load in-memory lookup - Refresh timer start >> -> Stream processing start >> -> Call lookup >> -> Use lookup result in Stream processing >> -> Timer elapsed -> Reload lookup data source into in-memory table >> -> Continue processing >> >> >> My concern around these are : >> >> 1) Possibly storing the same copy of data in every Task slots memory or >> state backend(RocksDB in my case). >> 2) Having a dedicated refresh thread for each subtask instance(possibly, >> every Task Manager having multiple refresh thread) >> >> Am i thinking in the right direction? Or missing something very obvious? It >> confusing. >> >> Any leads are much appreciated. Thanks in advance. >> >> Cheers, >> Chirag -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra