[ https://issues.apache.org/jira/browse/FLINK-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291217#comment-15291217 ]
ASF GitHub Bot commented on FLINK-3780: --------------------------------------- Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1980#discussion_r63893365 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2051,6 +2052,26 @@ The algorithm takes a directed, vertex (and possibly edge) attributed graph as i vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each vertex and edge in the output graph stores the common group value and the number of represented elements. +### Jaccard Index + +#### Overview +The Jaccard Index measures the similarity between vertex neighborhoods. Scores range from 0.0 (no common neighbors) to +1.0 (all neighbors are common). + +#### Details +Counting common neighbors for pairs of vertices is equivalent to counting the two-paths consisting of two edges +connecting the two vertices to the common neighbor. The number of distinct neighbors for pairs of vertices is computed +by storing the sum of degrees of the vertex pair and subtracting the count of common neighbors, which are double-counted +in the sum of degrees. + +The algorithm first annotates each edge with the endpoint degree. Grouping on the midpoint vertex, each pair of +neighbors is emitted with the endpoint degree sum. Grouping on two-paths, the common neighbors are counted. + +#### Usage +The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs, +the number of common neighbors, and the number of distinct neighbors. The graph ID type must be `Comparable` and --- End diff -- It does, from `Result.getJaccardIndexScore()`. > Jaccard Similarity > ------------------ > > Key: FLINK-3780 > URL: https://issues.apache.org/jira/browse/FLINK-3780 > Project: Flink > Issue Type: New Feature > Components: Gelly > Affects Versions: 1.1.0 > Reporter: Greg Hogan > Assignee: Greg Hogan > Fix For: 1.1.0 > > > Implement a Jaccard Similarity algorithm computing all non-zero similarity > scores. This algorithm is similar to {{TriangleListing}} but instead of > joining two-paths against an edge list we count two-paths. > {{flink-gelly-examples}} currently has {{JaccardSimilarityMeasure}} which > relies on {{Graph.getTriplets()}} so only computes similarity scores for > neighbors but not neighbors-of-neighbors. > This algorithm is easily modified for other similarity scores such as > Adamic-Adar similarity where the sum of endpoint degrees is replaced by the > degree of the middle vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)