Hey all, I’ve often found that my spark programs run much more stable with a higher number of partitions, and a lot of the graphs I deal with will have a few hundred large part files. I was wondering if having a parameter in GraphLoader, defaulting to false, to set the shuffle parameter in coalesce is something that might be added to graphx, or if there was a good reason for not including it? I’ve been using this patch myself for a couple weeks.
—Jeff diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala index f4c7936..b2f9e9c 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala @@ -58,13 +58,14 @@ object GraphLoader extends Logging { canonicalOrientation: Boolean = false, minEdgePartitions: Int = 1, edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, - vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) + vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, + shuffle: Boolean = false) : Graph[Int, Int] = { val startTime = System.currentTimeMillis // Parse the edge data table directly into edge partitions - val lines = sc.textFile(path, minEdgePartitions).coalesce(minEdgePartitions) + val lines = sc.textFile(path, minEdgePartitions).coalesce(minEdgePartitions, shuffle) val edges = lines.mapPartitionsWithIndex { (pid, iter) => val builder = new EdgePartitionBuilder[Int, Int] iter.foreach { line =>
signature.asc
Description: Message signed with OpenPGP using GPGMail