Hi Kant, 

              As far as I know, I think the current example connected 
components implementation based on DataSet API could not be extended to 
streaming data or incremental batch directly. 
              From the algorithm's perspective, if the graph only add edge and 
never remove edge, I think the connected components should be able to be 
updated incrementally when the graph changes: When some edges are added, a new 
search should be started from the sources of the added edges to propagate its 
component ID. This will trigger a new pass of update of the following vertices, 
and the updates continues until no vertices' component ID get updated. However, 
if there are also edge removes, I think the incremental computation should not 
be easily achieved. 
              To implement the above logic on Flink, I think currently there 
should be two possible methods: 
                    1) Use DataSet API and DataSet iteration, maintains the 
graph structure and the latest computation result in a storage, and whenever 
there are enough changes to the graph, submits a new DataSet job to recompute 
the result. The job should load the edges, the latest component id and whether 
it is the source of the newly added edges for each graph vertex, and then start 
the above incremental computation logic. 
                    2) Flink also provide DataStream iteration API[1] that 
enables iterating on the unbounded data. In this case the graph modification 
should be modeled as a datastream, and some operators inside the iteration 
should maintain the graph structure and current component id. whenever there 
are enough changes, it starts a new pass of computation.

    Best,
     Yun

        [1] Flink DataStream iteration, 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations


------------------------------------------------------------------
From:kant kodali <kanth...@gmail.com>
Send Time:2020 Feb. 18 (Tue.) 15:11
To:user <user@flink.apache.org>
Subject:Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components can run on a streaming data? or say 
incremental batch?

I see that with delta iteration not all vertices need to participate at every 
iteration which is great but in my case the graph is evolving over time other 
words new edges are getting added over time. If so, does the algorithm needs to 
run on the entire graph or can it simply run on the new batch of edges?

Finally, What is the best way to compute connected components on Graphs 
evolving over time? Should I use streaming or batch or any custom incremental 
approach? Also, the documentation take maxIterations as an input. How can I 
come up with a good number for max iterations? and once I come up with a good 
number for max Iterations is the algorithm guaranteed to converge?


Thanks,
Kant


Reply via email to