Hi Govindarajan, I've put some answers in-line below..
On Sat, Sep 24, 2016 at 7:32 PM, Govindarajan Srinivasaraghavan < govindragh...@gmail.com> wrote: > Hi, > > I'm working on apache flink for data streaming and I have few questions. > Any help is greatly appreciated. Thanks. > > 1) Are there any restrictions on creating tumbling windows. For example, > if I want to create a tumbling window per user id for 2 secs and let’s say > if I have more than 10 million user id's would that be a problem. (I'm > using keyBy user id and then creating a timeWindow for 2 secs)? How are > these windows maintained internally in flink? > That should not be a problem in general. An important question may be how many unique keys will you see in two seconds. This is more important than your total key cardinality of 10 Million and probably a *much* smaller number unless your input message rate is really high. > > 2) I looked at rebalance for round robin partitioning. Let’s say I have a > cluster set up and if I have a parallelism of 1 for source and if I do a > rebalance, will my data be shuffled across machines to improve performance? > If so is there a specific port using which the data is transferred to other > nodes in the cluster? > Yes, rebalance() does a round-robin distribution of messages to other machines in the cluster. There is not a specific port used for each TaskManager to communicate on but rather an available port is assigned at runtime. This is the default. You can also set this to a specific port if you have reason and a lot depends on how you will deploy -- via YARN or as a standalone Flink cluster. > > 3) Are there any limitations on state maintenance? I'm planning to > maintain some user id related data which could grow very large. I read > about flink using rocks db to maintain the state. Just wanted to check if > there are any limitations on how much data can be maintained? > Yes, there are limits. The total data that can be maintained today is determined by the fact that Flink has to periodically snapshot this data and copy it to a persistent storage system such as HDFS whether you are using RocksDB or not. The aggregate bandwidth required to your storage system (like HDFS) is your total Flink state size multiplied by your Flink checkpoint interval. > 4) Also where is the state maintained if the amount of data is less? (I > guess in JVM memory) If I have several machines on my cluster can every > node get the current state version? > I'm not exactly sure what you're asking here. All data is check-pointed to a persistent store which must be accessible from each machine in the cluster. > 5) I need a way to send external configuration changes to flink. Lets say > there is a new parameter that has to added or an external change which has > to be updated inside flink's state, how can this be done? > The typical way to do this is to consume that configuration as a stream and hold the configuration internally in the state of a particular user function. > > Thanks > -- Jamie Grier data Artisans, Director of Applications Engineering @jamiegrier <https://twitter.com/jamiegrier> ja...@data-artisans.com