Hi so here is what I have done... 1- I load my CSV using CSV table source. 2- 1 setup Kafka stream to read my incoming events. 3- Map my events to a POJO 4- Join the 2 tables 5- Push the joined result to Elastic search.
This works absolutely fine. So whats the difference between this and the proposed solutions above? On Mon, 30 Sep 2019 at 13:35, John Smith <java.dev....@gmail.com> wrote: > Ok thanks. It's basically telephone area codes, they barely ever change. > > On Mon, 30 Sep 2019 at 06:03, Gaël Renoux <gael.ren...@datadome.co> wrote: > >> Hi John, >> >> I've had a similar requirement, and I've resorted to simply use a static >> cache (I'm coding in Scala, so that's a lazy value on a singleton object - >> in Java that would be a static value on some utility class, with a >> synchronized lazy-loading getter). The value is reloaded after some >> duration, which adds a small latency at regular intervals. Keep in mind >> that one instance of that value will be loaded on each task manager >> (provided that at least one task running on that task manager calls the >> getter). >> >> If you're OK with restarting the job when your data changes, it would be >> better to load it on start (no need to synchronize stuff). Just load it >> inside your job initialization code (it will be executed within the job >> manager) and pass that data as a parameter to your operator's constructor. >> The data format must be serializable. >> >> Gaël >> >> >> On Sat, Sep 28, 2019 at 2:18 AM Sameer Wadkar <sam...@axiomine.com> >> wrote: >> >>> The main consideration in these type of scenarios is not the type of >>> source function you use. The key point is how does the event operator get >>> the slow moving master data and cache it. And then recover it if it fails >>> and restarts again. >>> >>> It does not matter that the csv file does not change often. It is >>> possible that the event operator may fail and restart. The csv data needs >>> to made available to it again. >>> >>> In that scenario the initial suggestion I made to pass the csv data in >>> the constructor is not adequate by itself. You need to store it in the >>> operator state which allows it to recover it when it restarts on failure. >>> >>> As long as the above takes place you have resiliency and you can use any >>> suitable method or source. I have not used Table source as much but >>> connected streams and operator state has worked out for me in similar >>> scenarios. >>> >>> Sameer >>> >>> Sent from my iPhone >>> >>> On Sep 27, 2019, at 4:38 PM, John Smith <java.dev....@gmail.com> wrote: >>> >>> It's a fairly small static file that may update once in a blue moon lol >>> But I'm hopping to use existing functions. Why can't I just use CSV to >>> table source? >>> >>> Why should I have to now either write my own CSV parser or look for 3rd >>> party, then what put in a Java Map and lookup that map? I'm finding Flink >>> to be a bit of death by 1000 paper cuts lol >>> >>> if i put the CSV in a table I can then use it to join across it with the >>> event no? >>> >>> On Fri, 27 Sep 2019 at 16:25, Sameer W <sam...@axiomine.com> wrote: >>> >>>> Connected Streams is one option. But may be an overkill in your >>>> scenario if your CSV does not refresh. If your CSV is small enough (number >>>> of records wise), you could parse it and load it into an object >>>> (serializable) and pass it to the constructor of the operator where you >>>> will be streaming the data. >>>> >>>> If the CSV can be made available via a shared network folder (or S3 in >>>> case of AWS) you could also read it in the open function (if you use Rich >>>> versions of the operator). >>>> >>>> The real problem I guess is how frequently does the CSV update. If you >>>> want the updates to propagate in near real time (or on schedule) the option >>>> 1 ( parse in driver and send it via constructor does not work). Also in >>>> the second option you need to be responsible for refreshing the file read >>>> from the shared folder. >>>> >>>> In that case use Connected Streams where the stream reading in the file >>>> (the other stream reads the events) periodically re-reads the file and >>>> sends it down the stream. The refresh interval is your tolerance of stale >>>> data in the CSV. >>>> >>>> On Fri, Sep 27, 2019 at 3:49 PM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> I don't think I need state for this... >>>>> >>>>> I need to load a CSV. I'm guessing as a table and then filter my >>>>> events parse the number, transform the event into geolocation data and >>>>> sink >>>>> that downstream data source. >>>>> >>>>> So I'm guessing i need a CSV source and my Kafka source and somehow >>>>> join those transform the event... >>>>> >>>>> On Fri, 27 Sep 2019 at 14:43, Oytun Tez <oy...@motaword.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> You should look broadcast state pattern in Flink docs. >>>>>> >>>>>> --- >>>>>> Oytun Tez >>>>>> >>>>>> *M O T A W O R D* >>>>>> The World's Fastest Human Translation Platform. >>>>>> oy...@motaword.com — www.motaword.com >>>>>> >>>>>> >>>>>> On Fri, Sep 27, 2019 at 2:42 PM John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Using 1.8 >>>>>>> >>>>>>> I have a list of phone area codes, cities and their geo location in >>>>>>> CSV file. And my events from Kafka contain phone numbers. >>>>>>> >>>>>>> I want to parse the phone number get it's area code and then >>>>>>> associate the phone number to a city, geo location and as well count how >>>>>>> many numbers are in that city/geo location. >>>>>>> >>>>>> >> >> -- >> Gaël Renoux >> Senior R&D Engineer, DataDome >> M +33 6 76 89 16 52 <+33+6+76+89+16+52> >> E gael.ren...@datadome.co <gael.ren...@datadome.co> >> W www.datadome.co >> <http://www.datadome.co?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> >> >> >> <https://www.facebook.com/datadome/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> >> <https://fr.linkedin.com/company/datadome?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> >> <https://twitter.com/data_dome?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> >> [image: Read DataDome reviews on G2] >> <https://www.g2.com/products/datadome/reviews?utm_source=review-widget&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature> >> >