Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.
If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1]. This will have several benefits: 1) State does not need to be replicated 2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM. 3) Flink takes care of the look-up. No need to have your own hashmap. 4) It will only be possible to rescale jobs with key-value state (this feature is currently under development). If using the key-value state is possible, I'd go for that. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html 2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <chakravarth...@gmail.com>: > certainly, what I thought as well... > The output of DataStream2 could be in 1000s and there are state updates... > reading this topic from the other job, job1, is okie. > However, assuming that we maintain this state into a collection, and > updating the state (by reading from the topic) in this collection, will > this be replicated across the cluster within this job1 ? > > > > On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Is writing DataStream2 to a Kafka topic and reading it from the other job >> an option? >> >> 2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <chakravarth...@gmail.com> >> : >> >>> Hi Fabian, >>> >>> Thanks for your response. Apparently these DataStream >>> (Job1-DataStream1 & Job2-DataStream2) are from different flink applications >>> running within the same cluster. >>> DataStream2 (from Job2) applies transformations and updates a >>> 'cache' on which (Job1) needs to work on. >>> Our intention is to not use the external key/value store as we are >>> trying to localize the cache within the cluster. >>> Is there a way? >>> >>> Best Regards >>> CVP >>> >>> On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <fhue...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Flink does not provide shared state. >>>> However, you can broadcast a stream to CoFlatMapFunction, such that >>>> each operator has its own local copy of the state. >>>> >>>> If that does not work for you because the state is too large and if it >>>> is possible to partition the state (and both streams), you can also use >>>> keyBy instead of broadcast. >>>> >>>> Finally, you can use an external system like a KeyValue Store or >>>> In-Memory store like Apache Ignite to hold your distributed collection. >>>> >>>> Best, Fabian >>>> >>>> 2016-09-07 17:49 GMT+02:00 Chakravarthy varaga < >>>> chakravarth...@gmail.com>: >>>> >>>>> Hi Team, >>>>> >>>>> Can someone help me here? Appreciate any response ! >>>>> >>>>> Best Regards >>>>> Varaga >>>>> >>>>> On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga < >>>>> chakravarth...@gmail.com> wrote: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> I'm working on a Flink Streaming application. The data is >>>>>> injected through Kafka connectors. The payload volume is roughly >>>>>> 100K/sec. >>>>>> The event payload is a string. Let's call this as DataStream1. >>>>>> This application also uses another DataStream, call it DataStream2, >>>>>> (consumes events off a kafka topic). The elements of this DataStream2 >>>>>> involves in a certain transformation that finally updates a Hashmap(/Java >>>>>> util Collection). Apparently the flink application should share this >>>>>> HashMap across the flink cluster so that DataStream1 application could >>>>>> check the state of the values in this collection. Is there a way to do >>>>>> this >>>>>> in Flink? >>>>>> >>>>>> I don't see any Shared Collection used within the cluster? >>>>>> >>>>>> Best Regards >>>>>> CVP >>>>>> >>>>> >>>>> >>>> >>> >> >