Hi Everyone, I am running into a really weird problem that only one other person has reported to the best of my knowledge (and the thread never yielded a resolution). I build a GraphX Graph from an input EdgeRDD and VertexRDD via the Graph(VertexRDD,EdgeRDD) constructor. When I execute Graph.triplets on the Graph I get wildly varying results where the triplet source and destination vertex data are inconsistent between runs and rarely, if ever, match what I would expect from the input edge pairs that are used to generate the VertexRDD and EdgeRDDs.
Here's what I know for sure: 1. Consistency of Input Edge Data--I read the edges in from HBase and generate a "raw edge RDD" containing tuples consisting of a source edge name and destination edge name. I've written this RDD out to HDFS over several runs and confirmed that generation of the raw edge RDD is deterministic. 2. Consistency of Edge and Vertex Count--the overall numbers of edges and vertices in the EdgeRDD and VertexRDD, respectively, are consistent between jobs. 3. Inconsistency of Triplet Data--the output from Graph.triplets varies between jobs, where the edge pairings are different. 4. Disconnect Between Input Edge Data and Triplets--the input edge data often does not match the corresponding triplet data for the same job, but in some cases will. Interestingly, while the actual edge pairings as seen in the input edge data RDD and the triplets often don't match, the total number of edges in the input edge RDD and triplets RDD for each edge name is the same. Based upon what I've seen, it seems as if the vertex ids are skewed somehow, especially given point (4) where I noted that the total number of appearances of an edge name is consistent between input edge RDD data and triplet RDD data for the same job but, again, the pairings with edges on the other end of the relationship can vary. I will post my code later tonight/tomorrow AM, but wanted to see if this problem description matches what anyone else has seen. Thanks --John -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Non-Deterministic-Graph-Building-tp22638.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org