Greg, We ran into this Issue when implementing the Mahout bindings for Flink [1]. It ended up being the major bottleneck for Mahout on Flink, and makes iterative algorithms basically unreasonable. While it is understook that that Flink's Delta-iterations are intended for use when iterating over Flink DataSet, they are not always suitable to the task. Eg. FlinkML's ALS.scala [2][3] which must flush intermediate results to the file system.
In-memory caching is a must have for any type of declarative ML DSL riding on top of fink. Eg, SystemML, Mahout, etc, or an internal Flink DSL. The Mahout bindings are Linear Algebra based. In close collaboration with the Flink community [4] and based on a contribution from an intern hired by Data Artisans to complete this task and other members of the community, we recently finished up work on the Mahout Distributed Linear Algebra bindings. Unfortunately the lack of an in-memory cache made the outcome of this effort very sub-optimal. After the unfinished hand-off of the bindings from the Flink community to the Mahout community, we were forced to find a workaround for the caching. We used the template of flushing and executing and flushing results to the File system [5], and had to release the bindings as "Experimental". Anyone building a declarative ML interface using the Flink Dataset API as a backend will run into similar issues as has been reported on the stackoverflow thread u refer to and it would be great to have this feature. Its great to see this being talked about. Andy [1] http://mahout.apache.org/users/flinkbindings/flink-internals.html [2] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala#L481 [3] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L84 [4]https://issues.apache.org/jira/browse/MAHOUT-1570 [5]https://github.com/apache/mahout/blob/master/flink/src/main/scala/org/apache/mahout/flinkbindings/drm/CheckpointedFlinkDrm.scala#L139 ________________________________________ From: Greg Hogan <c...@greghogan.com> Sent: Thursday, May 26, 2016 4:41:53 PM To: dev@flink.apache.org Subject: Iteration Intermediate Output Hi y'all, I think this is an oft-requested feature [0] and there are many graph algorithms for which intermediate output is the desired result. I'd like to take Stephan up on his offer [1] for pointers. I have yet to get in deep, but I see that iteration tasks are treated specially as IterationIntermediateTask for synchronization between supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are walking the program DAG an iteration must be first reached through the tail. Greg [0] http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html