Hello,

I'm trying to create a GraphX Graph by calling Graph(vertices:
RDD[(VertexId, VD)], edges: RDD[Edge[ED]]): Graph[VD, ED]. I'm passing in
two RDDs: one with vertices keyed by ID, and one with edges. I make sure to
coalesce both these RDDs down to the same number of partitions beforehand;
it seems to be an unwritten precondition that the two RDDs need to have
equal numbers of partitions.

That was working fine until today, when I ran across a particular
combination of RDDs that, when passed in, cause Graph() to produce a graph
where the vertices RDD and edges RDD have different numbers of partitions.
I'm not sure what's special about these particular RDDs; they both have 3
partitions going in, but internally GraphImpl apparently does this to the
vertices RDD:

val partitioner = Partitioner.defaultPartitioner(vertices)
val vPartitioned = vertices.partitionBy(partitioner)

This seems to result in the vertices RDD being condensed down to 1
partition while the edges RDD still has 3, leading to an error.

So, I have two questions:

What, exactly, needs to be true about the RDDs that you pass to Graph() to
be sure of constructing a valid graph? (Do they need to have the same
number of partitions? The same number of partitions and no empty
partitions? Do you need to repartition them with their default partitioners
beforehand?)

Why does GraphImpl repartition the vertices RDD?

I'm using Spark 1.0.0-incubating-SNAPSHOT, if it helps.

Thanks,
-Adam Novak
UCSC Bioinformatics Ph.D. Student

Reply via email to