On 2018/09/13 03:30:28, Ken Krugler <kkrugler_li...@transpac.com> wrote:
> Hi Bhaskar,
>
> > On 2018/09/12 20:42:22, Ken Krugler <kkrugler_li...@transpac.com> wrote:
> >> Hi Bhaskar,
> >>
> >> I assume you don’t have 1000 streams, but rather one (keyed) stream with
> >> 1000 different key values, yes?
> >>
> >> If so, then this one stream is physically partitioned based on the
> >> parallelism of the operator following the keyBy(), not per unique key.
> >>
> >> The most common per-key “resource” is the memory required for each key's
> >> state, if you’ve got any operations that need to maintain state
> >> (accumulators, windows, etc).
> >>
> >> For 1000 unique keys, this should be negligible.
> >>
> >> — Ken
> >>
> >>
> >>> On Sep 12, 2018, at 9:55 AM, bhaskar.eba...@gmail.com
> >>> <mailto:bhaskar.eba...@gmail.com> wrote:
> >>>
> >>> Hi
> >>>
> >>> I have created a KeyedStream with state as explained below
> >>> For example i have created 1000 streams, out of which 50% of streams
> >>> data is going to come once in 8 hours. Will the resources of these under
> >>> utilized streams are idle for that duration? Or Flink internal task
> >>> manager is having some strategy to utilize them for other new streams
> >>> that are coming?
> >>> Regards
> >>> Bhaskar
> >>
> > Hi Ken
> > As per documentation it is showing:
> > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/
> > On DataStream if we apply KeyBy then output is KeyedStream.
>
> Correct.
>
> > Once its stream means it should execute in parallel right?
>
> It will operate in parallel based on the parallelism of the downstream
> operations being applied to the KeyedStream.
>
> Number of unique keys has nothing to do with number of parallel
> (simultaneous) operators being used to process the KeyedStream.
>
> > There will be 1000 streams each is having Keyed State.
>
> “Stream” has a specific meaning in Flink, which I think you’re not using as
> intended here.
>
> > What you are saying is the main over head here is only memory. That means
> > Does these 1000 streams are going to be run across 1000 task slots in
> > parallel? These 1000 task slots is the main memory over head? Even 50% of
> > them idle its not harm?
>
> See above - you don’t have 1000 “task slots” and you don’t have 1000 stream.
>
> You have N operators running at the same time, where N is based on the
> parallelism that you set (either implicitly, or explicitly) for the
> operator(s) processing the KeyedStream.
>
> Note that If you have 1000 unique keys, and you’ve got (for example) a single
> ValueState per key, then you’d have 1000 states.
>
> But if you have say a sliding window, then the number of states per key can
> grow significantly, since each key can have multiple states (one per each
> “open window”).
>
> But also note that using RocksDB to handle state means that not all state has
> to be in memory at the same time, so you’ve got more room to scale.
>
>
> Regards,
>
> — Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>
Thanks Ken for the detailed clarification!
Regards
Bhaskar