You can take a look at the ALS implementation. There I did something
similar.

On Wed, Aug 12, 2015 at 10:27 AM, Sachin Goel <sachingoel0...@gmail.com>
wrote:

> Since the random splits need to be done on any data set a user provides, I
> think making a persistent source would be the best solution then.
>
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Aug 12, 2015 at 1:37 PM, Till Rohrmann <till.rohrm...@gmail.com>
> wrote:
>
> > One branch does not occupy a single slot. A slot is usually shared by
> > operators from multiple branches. Only subtasks of the same operator
> cannot
> > be placed into the same slot. Thus, it's not an argument against it.
> >
> > Most if not all input formats assign the input splits on a first comes
> > first serve basis with precedence for local computations. That is the
> > reason for the non-deterministic behaviour you're observing. If one of
> the
> > TMs is slower than in a previous run, then it might happen that he gets
> > different splits assigned. I guess, though, that you can write your own
> > input format which assigns the input splits deterministically (e.g. based
> > on the subtask index). But then you will probably sacrifice some of the
> > local computations.
> >
> > Cheers,
> > Till
> >
> > On Wed, Aug 12, 2015 at 9:54 AM, Sachin Goel <sachingoel0...@gmail.com>
> > wrote:
> >
> > > Hi Till
> > > Thanks for the reply.
> > > If you think about it however, having several diverging computational
> > paths
> > > from an intermediate point will probably require re-computation anyway,
> > in
> > > case the number of these paths is even higher than the slots available.
> > > Could that be an argument against a possible implementation?
> > > Making the output of the non-deterministic step persistent seems costly
> > > however. Is there any way to ensure that the data source is partitioned
> > > across the different slots in exactly the same way every-time?
> > > For example, I am using a {{generateSequence}} call, and the internal
> > > iterator, namely the NumberSequenceIterator seems deterministic in its
> > > operation, at least as far as how the elements are grouped together.
> But
> > > surprisingly, I observed different splits now and then.
> > >
> > > Regards
> > > Sachin
> > >
> > > -- Sachin Goel
> > > Computer Science, IIT Delhi
> > > m. +91-9871457685
> > >
> > > On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > >
> > > > At the moment, Flink does not support the calculation of intermediate
> > > > results from which you can continue your computation. When you
> execute
> > > jobs
> > > > which share parts of its job graph, then they are recomputed. When
> your
> > > job
> > > > contains operators with non-deterministic output, then there is no
> > > > guarantee that the shared job graph parts produce the same results.
> > > >
> > > > What you can do is to execute the two jobs in parallel so that they
> > share
> > > > the input of the non-deterministic operator. Alternatively, you can
> > > persist
> > > > the data set after your non-deterministic operator by writing it
> > manually
> > > > to disc and reading it from there.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <
> sachingoel0...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > I'm writing a utility to split a data set randomly into several
> parts
> > > and
> > > > > return an Array of data sets. However, whenever I operate on any of
> > > > > these *subsets,
> > > > > *the program basically start from the original data set, and the
> > split
> > > is
> > > > > performed again.
> > > > >
> > > > > To ensure that these subsets are mutually exclusive, we need to
> > > generate
> > > > > the exact same sequence of random numbers, but also to ensure that
> > the
> > > > > elements arrive in a filter job in exactly the same order. How do I
> > > > achieve
> > > > > this?
> > > > > Here's the code I've written:
> > > > > https://github.com/apache/flink/pull/921/files
> > > > >
> > > > > Regards
> > > > > Sachin
> > > > >
> > > > > -- Sachin Goel
> > > > > Computer Science, IIT Delhi
> > > > > m. +91-9871457685
> > > > >
> > > >
> > >
> >
>

Reply via email to