Hello beautiful Flink people, during the past few days, Andra and I have been discussing about how to extend Gelly's iteration methods.
Alexander's course (and his awesome students) has made it obvious that vertex-centric iterations are not the best fit for algorithms which don't follow the common "propagate-update" pattern. For example, Andra is working on an implementation of Minimum Spanning Tree, which requires branching inside an iteration and also requires a convergence check of an internal iteration. Others also reported similar issues [1, 2]. Trying to fit such algorithms to the vertex-centric model leads to long and ugly code, e.g. aggregators to keep track of algorithm phases, duplicating data, etc. One limitation of the vertex-centric and the upcoming GAS model is that they both only allow the vertex values to be updated in each iteration. However, for some algorithms we need to update the edge values and in others we need to update both. In even more complex situations (like Andra's MST) in some iterations we need to update the vertex values and in some iterations we need to update the edge values. Another problem is that we currently don't have a way to allow different computational phases inside an iteration. This is something that Giraph solves with master compute, a function that is executed once before each superstep and sets the computation function. All that said, I believe that we can solve most of these issues if we nicely expose Flink's iteration operators in Gelly. I can see the following cases: 1. Bulk & delta iterations where the solution set is the vertex dataset: this will be similar to vertex-centric and GAS, but will allow more flexible dataflows inside the iteration. 2. Bulk & delta iterations where the solution set is the edge dataset: for the cases where we need to update edge values. 3. Bulk & delta iterations where the solution set is the Graph: this will cover more complex cases, where the algorithm updates both vertices and edges or even adds/removes vertices/edges, i.e. updates the whole Graph. What do you think? I can see 1 & 2 being very easy to implement, but I suspect 3 won't be that easy (but so awesome to have ^^). Would it work the way a Graph is represented now, i.e. with 2 DataSets? Any comment, idea, pointer would be much appreciated! Thank you ^^ Cheers, -V. [1]: http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html [2]: http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769