Hello Dmitriy,

If I understood correctly what you are basically talking about modifying a
DataSet as you iterate over it.

AFAIK this is currently not possible in Flink, and indeed it's a real
bottleneck for ML algorithms. This is the reason our current
SGD implementation does a pass over the whole dataset at each iteration,
since we cannot take a sample from the dataset
and iterate only over that (so it's not really stochastic).

The relevant JIRA is here: https://issues.apache.org/jira/browse/FLINK-2396

I would love to start a discussion on how we can proceed to fix this.

Regards,
Theodore

On Tue, Mar 22, 2016 at 9:56 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Hi,
>
> probably more of a question for Till:
>
> Imagine a common ML algorithm flow that runs until convergence.
>
> typical distributed flow would be something like that (e.g. GMM EM would be
> exactly like that):
>
> A: input
>
> do {
>
>    stat1 = A.map.reduce
>    A = A.update-map(stat1)
>    conv = A.map.reduce
> } until conv > convThreshold
>
> There probably could be 1 map-reduce step originating on A to compute both
> convergence criteria statistics and udpate statistics in one step. not the
> point.
>
> The point is that update and map.reduce originate on the same dataset
> intermittently.
>
> In spark we would normally commit A to a object tree cache so that data is
> available to subsequent map passes without any I/O or serialization
> operations, thus insuring high rate of iterations.
>
> We observe the same pattern pretty much everywhere. clustering,
> probabilistic algorithms, even batch gradient descent of quasi newton
> algorithms fitting.
>
> How do we do something like that, for example, in FlinkML?
>
> Thoughts?
>
> thanks.
>
> -Dmitriy
>

Reply via email to