Hi, probably more of a question for Till:
Imagine a common ML algorithm flow that runs until convergence. typical distributed flow would be something like that (e.g. GMM EM would be exactly like that): A: input do { stat1 = A.map.reduce A = A.update-map(stat1) conv = A.map.reduce } until conv > convThreshold There probably could be 1 map-reduce step originating on A to compute both convergence criteria statistics and udpate statistics in one step. not the point. The point is that update and map.reduce originate on the same dataset intermittently. In spark we would normally commit A to a object tree cache so that data is available to subsequent map passes without any I/O or serialization operations, thus insuring high rate of iterations. We observe the same pattern pretty much everywhere. clustering, probabilistic algorithms, even batch gradient descent of quasi newton algorithms fitting. How do we do something like that, for example, in FlinkML? Thoughts? thanks. -Dmitriy