Hello, I cannot process graph with 230M edges. I cloned apache.spark, build it and then tried it on cluster.
I used Spark Standalone Cluster: -5 machines (each has 12 cores/32GB RAM) -'spark.executor.memory' == 25g -'spark.driver.memory' == 3g Graph has 231359027 edges. And its file weights 4,524,716,369 bytes. Graph is represented in text format: <source vertex id> <destination vertex id> My code: object Canonical { def main(args: Array[String]) { val numberOfArguments = 3 require(args.length == numberOfArguments, s"""Wrong argument number. Should be $numberOfArguments . |Usage: <path_to_grpah> <partiotioner_name> <minEdgePartitions> """.stripMargin) var graph: Graph[Int, Int] = null val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1) val partitionerName = args(1) val minEdgePartitions = args(2).toInt val sc = new SparkContext(new SparkConf() .setSparkHome(System.getenv("SPARK_HOME")) .setAppName(s" partitioning | $nameOfGraph | $partitionerName | $minEdgePartitions parts ") .setJars(SparkContext.jarOfClass(this.getClass).toList)) graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions) graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName)) println(graph.edges.collect.length) println(graph.vertices.collect.length) } } After I run it I encountered number of java.lang.OutOfMemoryError: Java heap space errors and of course I did not get a result. Do I have problem in the code? Or in cluster configuration? Because it works fine for relatively small graphs. But for this graph it never worked. (And I do not think that 230M edges is too big data) Thank you for any advise! -- Cordialement, Hlib Mykhailenko Doctorant à INRIA Sophia-Antipolis Méditerranée 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex