Design document for FLINK-2254

Saumitra Shahapure Wed, 21 Oct 2015 10:46:30 -0700

In FLINK-2254 <https://issues.apache.org/jira/browse/FLINK-2254> , we want to 
extend Graph API of Gelly by adding support for bipartite graphs too. In the 
long term, the support for newer graph types can be added in the same way. Also 
specialised algorithms for special types of graphs should be implemented.


For bipartite graph, a new class BipartiteGraph can be implemented which 
extends Graph. Other graph algorithms which are written for Graph can be used 
for BipartiteGraph too, because of inheritance. 
Initialisation functions like  public static <K, VV, EV> Graph<K, VV, EV> 
fromCollection(Collection<Vertex<K, VV>> vertices,Collection<Edge<K, EV>> 
edges, ExecutionEnvironment context) need to be duplicated as public static <K, 
VV, EV> BipartiteGraph<K, VV, EV> fromCollection(Collection<Vertex<K, VV>> 
topVertices,Collection<Vertex<K, VV>> bottomVertices, Collection<Edge<K, EV>> 
edges, ExecutionEnvironment context)
Graph validation does not need to happen implicitly. user can call 
BipartiteGraph.validate() in case he wants to check sanity of data.
Vertex modes is primary requirement. BipartiteGraph should have functions, 
getTopVertices() and getBottomVertices(). There are three ways to implement it:
Simplest and least efficient way is to maintain vertices variable in Graph in 
addition to two more Datasets topVertices, bottomVertices. Benefit of this 
approach is that Graph class would not need any modification at all. 
this.vertices variable access inside Graph class would work correctly. 
Disadvantage is that multiple copies of vertex dataset are created and 
maintained. So this is inefficient in terms of memory.
Another way is to keep topVertices and bottomVertices variables in 
BipartiteGraph. vertices variable in Graph would stay empty. getVertices() 
function of Graph would be overloaded by BipartiteGraph and reimplemented as 
union of of topVertices and bottomVertices. Since, vertices is a private 
variable, external view of vertices stays unaffected. All functions of Graph 
class which use vertices local variable (e.g. getVerticesAsTuple2()) need to 
use getVertices() instead of this.vertices. Disadvantage of this method is 
Graph.vertices variable would have invalid value throughout for BipartiteGraph.
Another way is to ‘label’ vertices with an integer. So public class 
BipartiteGraph<K,VV,EV> extends Graph<K, Tuple2<VV,Integer>, EV> would be the 
signature. Initialisers would tag the vertices according to their mode. 
getVertices() should be overridden to strip-off the tag values so that users 
would get consistent view of vertex dataset for Graph and BipartiteGraph. 
getTopVertices/getBottomVertices would be filter functions on vertex tags.
I personally feel method 2 to be most elegant.
Functions like addVertices(vertexList) are not relevant without specifying 
whether they are top or bottom. A possibility could be to add them to 
topVertices by default. And have overloaded function addVertices(vertexList, 
mode) to add them to specific partition.
Specialised algorithms for Bipartite graphs can implement new 
BipartiteGraphAlgorithm interface. It would be similar to GraphAlgorithm.
In future, newer types of graphs can be similarly derived from Graph and type 
hierarchy of Graph can be created. Class hierarchy would be better here rather 
than interface hierarchy, because the functions which are applicable to all 
derived classes can be implemented in the base class.
As a part of future refactoring, Graph transformation functions should rather 
be a new class implementing GraphAlgorithm rather than function in Graph class. 
This may make implementing complex graph types easier.
PS. Do I need to add it in wiki too? I can copy the contents to wiki if you say 
so,
-- Saumitra Shahapure

Design document for FLINK-2254

Reply via email to