Hello, I'm trying to create a GraphX Graph by calling Graph(vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]]): Graph[VD, ED]. I'm passing in two RDDs: one with vertices keyed by ID, and one with edges. I make sure to coalesce both these RDDs down to the same number of partitions beforehand; it seems to be an unwritten precondition that the two RDDs need to have equal numbers of partitions.
That was working fine until today, when I ran across a particular combination of RDDs that, when passed in, cause Graph() to produce a graph where the vertices RDD and edges RDD have different numbers of partitions. I'm not sure what's special about these particular RDDs; they both have 3 partitions going in, but internally GraphImpl apparently does this to the vertices RDD: val partitioner = Partitioner.defaultPartitioner(vertices) val vPartitioned = vertices.partitionBy(partitioner) This seems to result in the vertices RDD being condensed down to 1 partition while the edges RDD still has 3, leading to an error. So, I have two questions: What, exactly, needs to be true about the RDDs that you pass to Graph() to be sure of constructing a valid graph? (Do they need to have the same number of partitions? The same number of partitions and no empty partitions? Do you need to repartition them with their default partitioners beforehand?) Why does GraphImpl repartition the vertices RDD? I'm using Spark 1.0.0-incubating-SNAPSHOT, if it helps. Thanks, -Adam Novak UCSC Bioinformatics Ph.D. Student
