Re: DataSetUtils zipWithIndex question

2016-03-31 Thread Flavio Pompermaier
Ok, thanks for the clarification Till! On Thu, Mar 31, 2016 at 2:14 PM, Till Rohrmann wrote: > A partition is the portion of data each task receives. Thus, the degree of > parallelism of your program/task decides how many different partitions you > have. Depending on the upstream operators (and

Re: DataSetUtils zipWithIndex question

2016-03-31 Thread Till Rohrmann
A partition is the portion of data each task receives. Thus, the degree of parallelism of your program/task decides how many different partitions you have. Depending on the upstream operators (and which data is send to which task), the partitions will most likely differ in size. Cheers, Till On T

Re: DataSetUtils zipWithIndex question

2016-03-31 Thread Flavio Pompermaier
Hi Till and Tarandeep, I'm also interested in better understanding my knowledge about the concept of a partition.. >From what I know a partition is the portion of data assigned by the job manager to each task manager..right? Then, each partition is divided again at the task manager to maximize the

Re: DataSetUtils zipWithIndex question

2016-03-31 Thread Till Rohrmann
Hi Tarandeep, the number of elements in each partition should stay constant. In fact the elements in each partition should not change. Cheers, Till On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh wrote: > Hi, > > I am looking at implementation of zipWithIndex in DataSetUtils- > > https://gith

DataSetUtils zipWithIndex question

2016-03-29 Thread Tarandeep Singh
Hi, I am looking at implementation of zipWithIndex in DataSetUtils- https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java It works in two phases/steps 1) Count number of elements in each partition (using mapPartition) 2) In second m