Vasia and I are looking for additional feedback on FLINK-3772. This ticket
[0] and PR [1] provides a set of graph algorithms which compute and store
the degree for vertices and edges.

Degree annotation is a basic component of many algorithms. For example,
PageRank requires the vertex out-degree to evenly divide the vertex score
among neighbors. The triangle enumeration algorithm in Gelly requires both
source and target degree for each edge as an optimization. The Jaccard and
Adamic-Adar similarities require the target degree for each edge.

As discussed in the ticket, Graph has methods for outDegrees, inDegrees,
and getDegrees. These are simple but difficult or impossible to use
efficiently.

Stepping back, I believe algorithms should be composable and reused where
possible, not only to improve Flink but also to support users. Implementing
algorithms as classes rather than Graph methods enables customization and
optimization such as used here.

One such optimization is CachingGraphAlgorithm which implicitly reuses
DataSets. There is overlap between algorithms. From this ticket, annotating
edges with source and target degree on an undirected graph can reuse vertex
degree. Local clustering coefficient requires a triangle listing and global
clustering coefficient requires a triangle count, there is no need to
generate the list three times.

Further optimizations include the use of mutable types, reusing sort order,
avoidance of coGroup, configurable parallelism, and not unnecessarily
materializing DataSets. I see all this as the expectation for inclusion in
Flink.

Greg

[0] https://issues.apache.org/jira/browse/FLINK-3772
[1] https://github.com/apache/flink/pull/1901/files

Reply via email to