Hello, 

I cannot process graph with 230M edges. 
I cloned apache.spark, build it and then tried it on cluster. 

I used Spark Standalone Cluster: 
-5 machines (each has 12 cores/32GB RAM) 
-'spark.executor.memory' == 25g 
-'spark.driver.memory' == 3g 


Graph has 231359027 edges. And its file weights 4,524,716,369 bytes. 
Graph is represented in text format: 
<source vertex id> <destination vertex id> 

My code: 


object Canonical { 

def main(args: Array[String]) { 

val numberOfArguments = 3 
require(args.length == numberOfArguments, s"""Wrong argument number. Should be 
$numberOfArguments . 
|Usage: <path_to_grpah> <partiotioner_name> <minEdgePartitions> 
""".stripMargin) 

var graph: Graph[Int, Int] = null 
val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1) 
val partitionerName = args(1) 
val minEdgePartitions = args(2).toInt 

val sc = new SparkContext(new SparkConf() 
.setSparkHome(System.getenv("SPARK_HOME")) 
.setAppName(s" partitioning | $nameOfGraph | $partitionerName | 
$minEdgePartitions parts ") 
.setJars(SparkContext.jarOfClass(this.getClass).toList)) 

graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = 
StorageLevel.MEMORY_AND_DISK, 
vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = 
minEdgePartitions) 
graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName)) 
println(graph.edges.collect.length) 
println(graph.vertices.collect.length) 
} 
} 



After I run it I encountered number of java.lang.OutOfMemoryError: Java heap 
space errors and of course I did not get a result. 

Do I have problem in the code? Or in cluster configuration? 

Because it works fine for relatively small graphs. But for this graph it never 
worked. (And I do not think that 230M edges is too big data) 




Thank you for any advise! 



-- 
Cordialement, 
Hlib Mykhailenko 
Doctorant à INRIA Sophia-Antipolis Méditerranée 
2004 Route des Lucioles BP93 
06902 SOPHIA ANTIPOLIS cedex 

Reply via email to