Hello Dmitriy, If I understood correctly what you are basically talking about modifying a DataSet as you iterate over it.
AFAIK this is currently not possible in Flink, and indeed it's a real bottleneck for ML algorithms. This is the reason our current SGD implementation does a pass over the whole dataset at each iteration, since we cannot take a sample from the dataset and iterate only over that (so it's not really stochastic). The relevant JIRA is here: https://issues.apache.org/jira/browse/FLINK-2396 I would love to start a discussion on how we can proceed to fix this. Regards, Theodore On Tue, Mar 22, 2016 at 9:56 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Hi, > > probably more of a question for Till: > > Imagine a common ML algorithm flow that runs until convergence. > > typical distributed flow would be something like that (e.g. GMM EM would be > exactly like that): > > A: input > > do { > > stat1 = A.map.reduce > A = A.update-map(stat1) > conv = A.map.reduce > } until conv > convThreshold > > There probably could be 1 map-reduce step originating on A to compute both > convergence criteria statistics and udpate statistics in one step. not the > point. > > The point is that update and map.reduce originate on the same dataset > intermittently. > > In spark we would normally commit A to a object tree cache so that data is > available to subsequent map passes without any I/O or serialization > operations, thus insuring high rate of iterations. > > We observe the same pattern pretty much everywhere. clustering, > probabilistic algorithms, even batch gradient descent of quasi newton > algorithms fitting. > > How do we do something like that, for example, in FlinkML? > > Thoughts? > > thanks. > > -Dmitriy >