[ https://issues.apache.org/jira/browse/FLINK-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642480#comment-14642480 ]
Markus Holzemer commented on FLINK-2361: ---------------------------------------- I experienced the same problem when doing experiments with large graphs (>10GB). Random vertices from the solution set seem to be missing when doing a coGroup with the solution set. This occured on about 0,001% of the vertices for me. The interesting part is that in every superstep different vertices are missing, even from those that were present in the previous superstep. I tried to locate the source of the problem and it looked like it is connected to the prober of the CompactingHashTable. But the source code of this hash table was too complex for me to find the problem. Since I did not need exact results for my experiments I stopped searching. > flatMap + distinct gives erroneous results for big data sets > ------------------------------------------------------------ > > Key: FLINK-2361 > URL: https://issues.apache.org/jira/browse/FLINK-2361 > Project: Flink > Issue Type: Bug > Components: Gelly > Affects Versions: 0.10 > Reporter: Andra Lungu > > When running the simple Connected Components algorithm (currently in Gelly) > on the twitter follower graph, with 1, 100 or 10000 iterations, I get the > following error: > Caused by: java.lang.Exception: Target vertex '657282846' does not exist!. > at > org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300) > at > org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220) > at > org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) > at > org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139) > at > org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:722) > Now this is very bizzare as the DataSet of vertices is produced from the > DataSet of edges... Which means there cannot be a an edge with an invalid > target id... The method calls flatMap to isolate the src and trg ids and > distinct to ensure their uniqueness. > The algorithm works fine for smaller data sets... -- This message was sent by Atlassian JIRA (v6.3.4#6332)