You can take a look at the ALS implementation. There I did something similar.
On Wed, Aug 12, 2015 at 10:27 AM, Sachin Goel <sachingoel0...@gmail.com> wrote: > Since the random splits need to be done on any data set a user provides, I > think making a persistent source would be the best solution then. > > > -- Sachin Goel > Computer Science, IIT Delhi > m. +91-9871457685 > > On Wed, Aug 12, 2015 at 1:37 PM, Till Rohrmann <till.rohrm...@gmail.com> > wrote: > > > One branch does not occupy a single slot. A slot is usually shared by > > operators from multiple branches. Only subtasks of the same operator > cannot > > be placed into the same slot. Thus, it's not an argument against it. > > > > Most if not all input formats assign the input splits on a first comes > > first serve basis with precedence for local computations. That is the > > reason for the non-deterministic behaviour you're observing. If one of > the > > TMs is slower than in a previous run, then it might happen that he gets > > different splits assigned. I guess, though, that you can write your own > > input format which assigns the input splits deterministically (e.g. based > > on the subtask index). But then you will probably sacrifice some of the > > local computations. > > > > Cheers, > > Till > > > > On Wed, Aug 12, 2015 at 9:54 AM, Sachin Goel <sachingoel0...@gmail.com> > > wrote: > > > > > Hi Till > > > Thanks for the reply. > > > If you think about it however, having several diverging computational > > paths > > > from an intermediate point will probably require re-computation anyway, > > in > > > case the number of these paths is even higher than the slots available. > > > Could that be an argument against a possible implementation? > > > Making the output of the non-deterministic step persistent seems costly > > > however. Is there any way to ensure that the data source is partitioned > > > across the different slots in exactly the same way every-time? > > > For example, I am using a {{generateSequence}} call, and the internal > > > iterator, namely the NumberSequenceIterator seems deterministic in its > > > operation, at least as far as how the elements are grouped together. > But > > > surprisingly, I observed different splits now and then. > > > > > > Regards > > > Sachin > > > > > > -- Sachin Goel > > > Computer Science, IIT Delhi > > > m. +91-9871457685 > > > > > > On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > > > At the moment, Flink does not support the calculation of intermediate > > > > results from which you can continue your computation. When you > execute > > > jobs > > > > which share parts of its job graph, then they are recomputed. When > your > > > job > > > > contains operators with non-deterministic output, then there is no > > > > guarantee that the shared job graph parts produce the same results. > > > > > > > > What you can do is to execute the two jobs in parallel so that they > > share > > > > the input of the non-deterministic operator. Alternatively, you can > > > persist > > > > the data set after your non-deterministic operator by writing it > > manually > > > > to disc and reading it from there. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel < > sachingoel0...@gmail.com > > > > > > > wrote: > > > > > > > > > I'm writing a utility to split a data set randomly into several > parts > > > and > > > > > return an Array of data sets. However, whenever I operate on any of > > > > > these *subsets, > > > > > *the program basically start from the original data set, and the > > split > > > is > > > > > performed again. > > > > > > > > > > To ensure that these subsets are mutually exclusive, we need to > > > generate > > > > > the exact same sequence of random numbers, but also to ensure that > > the > > > > > elements arrive in a filter job in exactly the same order. How do I > > > > achieve > > > > > this? > > > > > Here's the code I've written: > > > > > https://github.com/apache/flink/pull/921/files > > > > > > > > > > Regards > > > > > Sachin > > > > > > > > > > -- Sachin Goel > > > > > Computer Science, IIT Delhi > > > > > m. +91-9871457685 > > > > > > > > > > > > > > >