Hi Everyone,

I am running into a really weird problem that only one other person has
reported to the best of my knowledge (and the thread never yielded a
resolution).  I build a GraphX Graph from an input EdgeRDD and VertexRDD via
the Graph(VertexRDD,EdgeRDD) constructor. When I execute Graph.triplets on
the Graph I get wildly varying results where the triplet source and
destination vertex data are inconsistent between runs and rarely, if ever,
match what I would expect from the input edge pairs that are used to
generate the VertexRDD and EdgeRDDs.

Here's what I know for sure:

1. Consistency of Input Edge Data--I read the edges in from HBase and
generate a "raw edge RDD" containing tuples consisting of a source edge name
and destination edge name. I've written this RDD out to HDFS over several
runs and confirmed that generation of the raw edge RDD is deterministic.

2. Consistency of Edge and Vertex Count--the overall numbers of edges and
vertices in the EdgeRDD and VertexRDD, respectively, are consistent between
jobs.

3. Inconsistency of Triplet Data--the output from Graph.triplets varies
between jobs, where the edge pairings are different.

4. Disconnect Between Input Edge Data and Triplets--the input edge data
often does not match the corresponding triplet data for the same job, but in
some cases will.  Interestingly, while the actual edge pairings as seen in
the input edge data RDD and the triplets often don't match, the total number
of edges in the input edge RDD and triplets RDD for each edge name is the
same.

Based upon what I've seen, it seems as if the vertex ids are skewed somehow,
especially given point (4) where I noted that the total number of
appearances of an edge name is consistent between input edge RDD data and
triplet RDD data for the same job but, again, the pairings with edges on the
other end of the relationship can vary.

I will post my code later tonight/tomorrow AM, but wanted to see if this
problem description matches what anyone else has seen.

Thanks

--John



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Non-Deterministic-Graph-Building-tp22638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to