Hi, Thanks for that but Looks like it is already available https://github.com/vasia/gelly-streaming in streaming but I wonder why this is not part of Flink? there are no releases either.
Thanks! On Tue, Feb 18, 2020 at 9:13 AM Yun Gao <yungao...@aliyun.com> wrote: > Hi Kant, > > As far as I know, I think the current example connected > components implementation based on DataSet API could not be extended to > streaming data or incremental batch directly. > > From the algorithm's perspective, if the graph only add edge > and never remove edge, I think the connected components should be able to > be updated incrementally when the graph changes: When some edges are added, > a new search should be started from the sources of the added edges to > propagate its component ID. This will trigger a new pass of update of the > following vertices, and the updates continues until no vertices' component > ID get updated. However, if there are also edge removes, I think the > incremental computation should not be easily achieved. > > To implement the above logic on Flink, I think currently > there should be two possible methods: > 1) Use DataSet API and DataSet iteration, maintains > the graph structure and the latest computation result in a storage, and > whenever there are enough changes to the graph, submits a new DataSet job > to recompute the result. The job should load the edges, the latest > component id and whether it is the source of the newly added edges for each > graph vertex, and then start the above incremental computation logic. > 2) Flink also provide DataStream iteration API[1] that > enables iterating on the unbounded data. In this case the graph > modification should be modeled as a datastream, and some operators inside > the iteration should maintain the graph structure and current component id. > whenever there are enough changes, it starts a new pass of computation. > > Best, > Yun > > [1] Flink DataStream iteration, > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations > > ------------------------------------------------------------------ > From:kant kodali <kanth...@gmail.com> > Send Time:2020 Feb. 18 (Tue.) 15:11 > To:user <user@flink.apache.org> > Subject:Can Connected Components run on a streaming dataset using iterate > delta? > > Hi All, > > I am wondering if connected components > <https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/examples.html#connected-components> > can run on a streaming data? or say incremental batch? > > I see that with delta iteration not all vertices need to participate at > every iteration which is great but in my case the graph is evolving over > time other words new edges are getting added over time. If so, does the > algorithm needs to run on the entire graph or can it simply run on the new > batch of edges? > > Finally, What is the best way to compute connected components on Graphs > evolving over time? Should I use streaming or batch or any custom > incremental approach? Also, the documentation take maxIterations as an > input. How can I come up with a good number for max iterations? and once I > come up with a good number for max Iterations is the algorithm guaranteed > to converge? > > > Thanks, > Kant > > >