Hi, 1) Yes, that is correct. If you set the parallelism of an operator to 1 it is only executed on a single node. It depends on your application, if you need a global state or whether multiple local states are OK. 2) Flink programs follow the concept a data flow. There is no communication between parallel instances of a task, i.e., all four tasks of a MapOperator with parallelism 4 cannot talk to each other. You might want to take a look at Flink's iteration operators. With these you can feed data back into a previous operator [1]. 4) Yes, that should work.
[1] https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/iterations.html 2016-06-09 15:01 GMT+02:00 Ravikumar Hawaldar <ravikumar.hawal...@gmail.com> : > Hi Fabian, Thank you for your answers, > > 1) If there is only single instance of that function, then it will defeat > the purpose of distributed correct me if I am wrong, so If I run > parallelism with 1 on cluster does that mean it will execute on only one > node? > > 2) I mean to say, when a map operator returns a variable, is there any > other function which takes that updated variable and returns that to all > instances of map? > > 3) Question Cleared. > > 4) My question was can I use same ExecutionEnvironment for all flink > programs in a module. > > 5) Question Cleared. > > > Regards > Ravikumar > > > > On 9 June 2016 at 17:58, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Ravikumar, >> >> I'll try to answer your questions: >> 1) If you set the parallelism of a map function to 1, there will be only >> a single instance of that function regardless whether it is execution >> locally or remotely in a cluster. >> 2) Flink does also support aggregations, (reduce, groupReduce, combine, >> ...). However, I do not see how this would help with a stateful map >> function. >> 3) In Flink DataSet programs you usually construct the complete program >> and call execute() after you have defined your sinks. There are two >> exceptions: print() and collect() which both add special sinks and >> immediately execute your program. print() prints the result to the stdout >> of the submitting client and collect() fetches a dataset as collection. >> 4) I am not sure I understood your question. When you obtain an >> ExecutionEnvironment with ExecutionEnvironment.getExecutionEnvrionment() >> the type of the returned environment depends on the context in which the >> program was executed. It can be a local environment if it is executed from >> within an IDE or a RemodeExecutionEnvironment if the program is executed >> via the CLI client and shipped to a remote cluster. >> 5) A map operator processes records one after the other, i.e., as a >> sequence. If you need a certain order, you can call DataSet.sortPartition() >> to locally sort the partition. >> >> Hope that helps, >> Fabian >> >> 2016-06-09 12:23 GMT+02:00 Ravikumar Hawaldar < >> ravikumar.hawal...@gmail.com>: >> >>> Hi Till, Thank you for your answer, I have couple of questions >>> >>> 1) Setting parallelism on a single map function in local is fine but on >>> distributed will it work as local execution? >>> >>> 2) Is there any other way apart from setting parallelism? Like spark >>> aggregate function? >>> >>> 3) Is it necessary that after transformations to call execute function? >>> Or Execution starts as soon as it encounters a action (Similar to Spark)? >>> >>> 4) Can I create a global execution environment (Either local or >>> distributed) for different Flink program in a module? >>> >>> 5) How to make the records come in sequence for a map or any other >>> operator? >>> >>> >>> Regards, >>> Ravikumar >>> >>> >>> On 8 June 2016 at 21:14, Till Rohrmann <trohrm...@apache.org> wrote: >>> >>>> Hi Ravikumar, >>>> >>>> Flink's operators are stateful. So you can simply create a variable in >>>> your mapper to keep the state around. But every mapper instance will have >>>> it's own state. This state is determined by the records which are sent to >>>> this mapper instance. If you need a global state, then you have to set the >>>> parallelism to 1. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Wed, Jun 8, 2016 at 5:08 PM, Ravikumar Hawaldar < >>>> ravikumar.hawal...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I have an DataSet<UserDefinedType> which is roughly a record in a >>>>> DataSet Or a file. >>>>> >>>>> Now I am using map transformation on this DataSet to compute a >>>>> variable (coefficients of linear regression parameters and data structure >>>>> used is a double[]). >>>>> >>>>> Now the issue is that, per record the variable will get updated and I >>>>> am struggling to maintain state of this variable for the next record. >>>>> >>>>> In simple, for first record the variable values will be 0.0, and after >>>>> first record the variable will get updated and I have to pass this updated >>>>> variable for the second record and so on for all records in DataSet. >>>>> >>>>> Any suggestions on how to maintain state of a variable? >>>>> >>>>> >>>>> Regards, >>>>> Ravikumar >>>>> >>>> >>>> >>> >> >