I'm understanding this better with your explanation.. With this use case, each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?
On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <fhue...@gmail.com> wrote: > Operator state is always local in Flink. However, with key-value state, > you can have something which behaves kind of similar to a distribute > hashmap, because each operator holds a different shard/partition of the > hashtable. > > If you have to do only a single key lookup for each element of DS1, you > should think about partitioning both streams (keyBy) and writing the state > into Flink's key-value state [1]. > > This will have several benefits: > 1) State does not need to be replicated > 2) Depending on the backend (RocksDB) [2], parts of the state can reside > on disk. You are not bound to the memory of the JVM. > 3) Flink takes care of the look-up. No need to have your own hashmap. > 4) It will only be possible to rescale jobs with key-value state (this > feature is currently under development). > > If using the key-value state is possible, I'd go for that. > > Best, Fabian > > [1] https://ci.apache.org/projects/flink/flink-docs- > release-1.1/apis/streaming/state.html > [2] https://ci.apache.org/projects/flink/flink-docs- > release-1.1/apis/streaming/state_backends.html > > 2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <chakravarth...@gmail.com>: > >> certainly, what I thought as well... >> The output of DataStream2 could be in 1000s and there are state updates... >> reading this topic from the other job, job1, is okie. >> However, assuming that we maintain this state into a collection, and >> updating the state (by reading from the topic) in this collection, will >> this be replicated across the cluster within this job1 ? >> >> >> >> On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <fhue...@gmail.com> wrote: >> >>> Is writing DataStream2 to a Kafka topic and reading it from the other >>> job an option? >>> >>> 2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <chakravarth...@gmail.com >>> >: >>> >>>> Hi Fabian, >>>> >>>> Thanks for your response. Apparently these DataStream >>>> (Job1-DataStream1 & Job2-DataStream2) are from different flink applications >>>> running within the same cluster. >>>> DataStream2 (from Job2) applies transformations and updates a >>>> 'cache' on which (Job1) needs to work on. >>>> Our intention is to not use the external key/value store as we are >>>> trying to localize the cache within the cluster. >>>> Is there a way? >>>> >>>> Best Regards >>>> CVP >>>> >>>> On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <fhue...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Flink does not provide shared state. >>>>> However, you can broadcast a stream to CoFlatMapFunction, such that >>>>> each operator has its own local copy of the state. >>>>> >>>>> If that does not work for you because the state is too large and if it >>>>> is possible to partition the state (and both streams), you can also use >>>>> keyBy instead of broadcast. >>>>> >>>>> Finally, you can use an external system like a KeyValue Store or >>>>> In-Memory store like Apache Ignite to hold your distributed collection. >>>>> >>>>> Best, Fabian >>>>> >>>>> 2016-09-07 17:49 GMT+02:00 Chakravarthy varaga < >>>>> chakravarth...@gmail.com>: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> Can someone help me here? Appreciate any response ! >>>>>> >>>>>> Best Regards >>>>>> Varaga >>>>>> >>>>>> On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga < >>>>>> chakravarth...@gmail.com> wrote: >>>>>> >>>>>>> Hi Team, >>>>>>> >>>>>>> I'm working on a Flink Streaming application. The data is >>>>>>> injected through Kafka connectors. The payload volume is roughly >>>>>>> 100K/sec. >>>>>>> The event payload is a string. Let's call this as DataStream1. >>>>>>> This application also uses another DataStream, call it DataStream2, >>>>>>> (consumes events off a kafka topic). The elements of this DataStream2 >>>>>>> involves in a certain transformation that finally updates a >>>>>>> Hashmap(/Java >>>>>>> util Collection). Apparently the flink application should share this >>>>>>> HashMap across the flink cluster so that DataStream1 application could >>>>>>> check the state of the values in this collection. Is there a way to do >>>>>>> this >>>>>>> in Flink? >>>>>>> >>>>>>> I don't see any Shared Collection used within the cluster? >>>>>>> >>>>>>> Best Regards >>>>>>> CVP >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >