[ 
https://issues.apache.org/jira/browse/FLINK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Junghanns updated FLINK-2411:
------------------------------------
    Description: 
Graph summarization determines a structural grouping of similar vertices and 
edges to condense a graph and thus helps to uncover insights about patterns 
hidden in the graph. It can be used in OLAP-style operations on the graph and 
is similar to group by in SQL but on the graph structure instead of rows.
 
The graph summarization operator represents every vertex group by a single 
vertex in the summarized graph; edges between vertices in the summary graph 
represent a group of edges between the vertex group members of the original 
graph. Summarization is defined by specifying grouping keys for vertices and 
edges, respectively.

One publication that presents a Map/Reduce based approach is "Pagrol: Parallel 
graph olap over large-scale attributed graphs", however they pre-compute the 
graph-cube before it can be analyzed. With Flink, we can give the user an 
interactive way of summarizing the graph and do not need to compute the  cube 
beforehand.
A more complex approach focuses on summarization on graph patterns  "SynopSys: 
Large Graph Analytics in the SAP HANA Database Through Summarization".

However, I want to start with a simple algorithm that summarizes the graph on 
vertex and optionally edge values and additionally stores a count aggregate at 
summarized vertices/edges.

Consider the following two examples (e.g., social network with users from 
cities and friendships with timestamp):
 
h4. Input graph:
 
Vertices (id, value):
(0, Leipzig)
(1, Leipzig)
(2, Dresden)
(3, Dresden)
(4, Dresden)
(5, Berlin)

Edges (source, target, value):
(0, 1, 2014)
(1, 0, 2014)
(1, 2, 2013)
(2, 1, 2013)
(2, 3, 2014)
(3, 2, 2014)
(4, 0, 2013)
(4, 1, 2015)
(5, 2, 2015)
(5, 3, 2015)

h4. Output graph (summarized on vertex value):

Vertices (id, value, count)
(0, Leipzig, 2) // "2 users from Leipzig"
(2, Dresden, 3) // "3 users from Dresden"
(5, Berlin, 1) // "1 user from Berlin"

Edges (source, target, count) 
(0, 0, 2) // "2 edges between users in Leipzig"
(0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
(2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
(2, 2, 2) // "2 edges between users in Dresden"
(5, 2, 2) // "2 edges from users in Berlin to users in Dresden"

h4. Output graph (summarized on vertex and edge value):

Vertices (id, value, count)
(0, Leipzig, 2)
(2, Dresden, 3)
(5, Berlin, 1)

Edges (source, target, value, count) 
(0, 0, 2014, 2) // ...
(0, 2, 2013, 1) // ...
(2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with 
timestamp 2013"
(2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with 
timestamp 2015"
(2, 2, 2014, 2) // ...
(5, 2, 2015, 2) // ...

I've already implemented two versions of the summarization algorithm in our own 
project [Gradoop|https://github.com/dbs-leipzig/gradoop], 
which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed 
data model. You can see the current WIP here: 

1 [Abstract 
summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
2 [Implementation using 
cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
3 [Implementation using 
joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
4 
[Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model/impl/EPGraphSummarizeTest.java]
5 
[TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf]

I would basically use the same implementation as in 3 in combination with 
KeySelectors to select the grouping keys on vertices and edges.

As you can see in the example, each vertex in the resulting graph has a vertex 
id that is contained in the original graph. This id is the smallest id among 
the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the 
group representative). The latter is necessary to correctly assign the 
summarized edges. Maybe there is a smarter way 
to do it of which I did not think yet.

I would like contribute this to Flink and of course, if you have any 
suggestions/improvements or do not want this at all (hopefully not), please let 
me know.

  was:
Graph summarization determines a structural grouping of similar vertices and 
edges to condense a graph and thus helps to uncover insights about patterns 
hidden in the graph. It can be used in OLAP-style operations on the graph and 
is similar to group by in SQL but on the graph structure instead of rows.
 
The graph summarization operator represents every vertex group by a single 
vertex in the summarized graph; edges between vertices in the summary graph 
represent a group of edges between the vertex group members of the original 
graph. Summarization is defined by specifying grouping keys for vertices and 
edges, respectively.

One publication that presents a Map/Reduce based approach is "Pagrol: Parallel 
graph olap over large-scale attributed graphs", however they pre-compute the 
graph-cube before it can be analyzed. With Flink, we can give the user an 
interactive way of summarizing the graph and do not need to compute the  cube 
beforehand.
A more complex approach focuses on summarization on graph patterns  "SynopSys: 
Large Graph Analytics in the SAP HANA Database Through Summarization".

However, I want to start with a simple algorithm that summarizes the graph on 
vertex and optionally edge values and additionally stores a count aggregate at 
summarized vertices/edges.

Consider the following two examples (e.g., social network with users from 
cities and friendships with timestamp):
 
h4. Input graph:
 
Vertices (id, value):
(0, Leipzig)
(1, Leipzig)
(2, Dresden)
(3, Dresden)
(4, Dresden)
(5, Berlin)

Edges (source, target, value):
(0, 1, 2014)
(1, 0, 2014)
(1, 2, 2013)
(2, 1, 2013)
(2, 3, 2014)
(3, 2, 2014)
(4, 0, 2013)
(4, 1, 2015)
(5, 2, 2015)
(5, 3, 2015)

h4. Output graph (summarized on vertex value):

Vertices (id, value, count)
(0, Leipzig, 2) // "2 users from Leipzig"
(2, Dresden, 3) // "3 users from Dresden"
(5, Berlin, 1) // "1 user from Berlin"

Edges (source, target, count) 
(0, 0, 2) // "2 edges between users in Leipzig"
(0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
(2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
(2, 2, 2) // "2 edges between users in Dresden"
(5, 2, 2) // "2 edges from users in Berlin to users in Dresden"

h4. Output graph (summarized on vertex and edge value):

Vertices (id, value, count)
(0, Leipzig, 2)
(2, Dresden, 3)
(5, Berlin, 1)

Edges (source, target, value, count) 
(0, 0, 2014, 2) // ...
(0, 2, 2013, 1) // ...
(2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with 
timestamp 2013"
(2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with 
timestamp 2015"
(2, 2, 2014, 2) // ...
(5, 2, 2015, 2) // ...

I've already implemented two versions of the summarization algorithm in our own 
project https://github.com/dbs-leipzig/gradoop, 
which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed 
data model. You can see the current WIP here: 

1 [Abstract 
summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
2 [Implementation using 
cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
3 [Implementation using 
joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
4 
[Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model]/impl/EPGraphSummarizeTest.java]
5 
[TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf]

I would basically use the same implementation as in 3 in combination with 
KeySelectors to select the grouping keys on vertices and edges.

As you can see in the example, each vertex in the resulting graph has a vertex 
id that is contained in the original graph. This id is the smallest id among 
the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 is the 
group representative). The latter is necessary to correctly assign the 
summarized edges. Maybe there is a smarter way 
to do it of which I did not think yet.

I would like contribute this to Flink and of course, if you have any 
suggestions/improvements or do not want this at all (hopefully not), please let 
me know.


> Add basic graph summarization algorithm
> ---------------------------------------
>
>                 Key: FLINK-2411
>                 URL: https://issues.apache.org/jira/browse/FLINK-2411
>             Project: Flink
>          Issue Type: New Feature
>          Components: Gelly
>    Affects Versions: master
>            Reporter: Martin Junghanns
>            Priority: Minor
>
> Graph summarization determines a structural grouping of similar vertices and 
> edges to condense a graph and thus helps to uncover insights about patterns 
> hidden in the graph. It can be used in OLAP-style operations on the graph and 
> is similar to group by in SQL but on the graph structure instead of rows.
>  
> The graph summarization operator represents every vertex group by a single 
> vertex in the summarized graph; edges between vertices in the summary graph 
> represent a group of edges between the vertex group members of the original 
> graph. Summarization is defined by specifying grouping keys for vertices and 
> edges, respectively.
> One publication that presents a Map/Reduce based approach is "Pagrol: 
> Parallel graph olap over large-scale attributed graphs", however they 
> pre-compute the graph-cube before it can be analyzed. With Flink, we can give 
> the user an interactive way of summarizing the graph and do not need to 
> compute the  cube beforehand.
> A more complex approach focuses on summarization on graph patterns  
> "SynopSys: Large Graph Analytics in the SAP HANA Database Through 
> Summarization".
> However, I want to start with a simple algorithm that summarizes the graph on 
> vertex and optionally edge values and additionally stores a count aggregate 
> at summarized vertices/edges.
> Consider the following two examples (e.g., social network with users from 
> cities and friendships with timestamp):
>  
> h4. Input graph:
>  
> Vertices (id, value):
> (0, Leipzig)
> (1, Leipzig)
> (2, Dresden)
> (3, Dresden)
> (4, Dresden)
> (5, Berlin)
> Edges (source, target, value):
> (0, 1, 2014)
> (1, 0, 2014)
> (1, 2, 2013)
> (2, 1, 2013)
> (2, 3, 2014)
> (3, 2, 2014)
> (4, 0, 2013)
> (4, 1, 2015)
> (5, 2, 2015)
> (5, 3, 2015)
> h4. Output graph (summarized on vertex value):
> Vertices (id, value, count)
> (0, Leipzig, 2) // "2 users from Leipzig"
> (2, Dresden, 3) // "3 users from Dresden"
> (5, Berlin, 1) // "1 user from Berlin"
> Edges (source, target, count) 
> (0, 0, 2) // "2 edges between users in Leipzig"
> (0, 2, 1) // "1 edge from users in Leipzig to users in Dresden"
> (2, 0, 3) // "3 edges from users in Dresden to users in Leipzig"
> (2, 2, 2) // "2 edges between users in Dresden"
> (5, 2, 2) // "2 edges from users in Berlin to users in Dresden"
> h4. Output graph (summarized on vertex and edge value):
> Vertices (id, value, count)
> (0, Leipzig, 2)
> (2, Dresden, 3)
> (5, Berlin, 1)
> Edges (source, target, value, count) 
> (0, 0, 2014, 2) // ...
> (0, 2, 2013, 1) // ...
> (2, 0, 2013, 2) // "2 edges from users in Dresden to users in Leipzig with 
> timestamp 2013"
> (2, 0, 2015, 1) // "1 edge from users in Dresden to users in Leipzig with 
> timestamp 2015"
> (2, 2, 2014, 2) // ...
> (5, 2, 2015, 2) // ...
> I've already implemented two versions of the summarization algorithm in our 
> own project [Gradoop|https://github.com/dbs-leipzig/gradoop], 
> which is a graph analytics stack on top of Hadoop + Gelly/Flink with a fixed 
> data model. You can see the current WIP here: 
> 1 [Abstract 
> summarization|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java]
> 2 [Implementation using 
> cross|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationCross.java]
> 3 [Implementation using 
> joins|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/SummarizationJoin.java]
> 4 
> [Tests|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/test/java/org/gradoop/model/impl/EPGraphSummarizeTest.java]
> 5 
> [TestGraph|https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/dev-support/social-network.pdf]
> I would basically use the same implementation as in 3 in combination with 
> KeySelectors to select the grouping keys on vertices and edges.
> As you can see in the example, each vertex in the resulting graph has a 
> vertex id that is contained in the original graph. This id is the smallest id 
> among the grouped vertices (e.g., vertices 2, 3 and 4 represent Dresden, so 2 
> is the group representative). The latter is necessary to correctly assign the 
> summarized edges. Maybe there is a smarter way 
> to do it of which I did not think yet.
> I would like contribute this to Flink and of course, if you have any 
> suggestions/improvements or do not want this at all (hopefully not), please 
> let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to