Hi,

probably more of a question for Till:

Imagine a common ML algorithm flow that runs until convergence.

typical distributed flow would be something like that (e.g. GMM EM would be
exactly like that):

A: input

do {

   stat1 = A.map.reduce
   A = A.update-map(stat1)
   conv = A.map.reduce
} until conv > convThreshold

There probably could be 1 map-reduce step originating on A to compute both
convergence criteria statistics and udpate statistics in one step. not the
point.

The point is that update and map.reduce originate on the same dataset
intermittently.

In spark we would normally commit A to a object tree cache so that data is
available to subsequent map passes without any I/O or serialization
operations, thus insuring high rate of iterations.

We observe the same pattern pretty much everywhere. clustering,
probabilistic algorithms, even batch gradient descent of quasi newton
algorithms fitting.

How do we do something like that, for example, in FlinkML?

Thoughts?

thanks.

-Dmitriy

Reply via email to