Hi, all I am running a 30GB Wikipedia dataset on a 7-server cluster. Using WikipediaPageRank underexample/Bagel.
My Spark version is bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum The problem is that the job will fail after several stages because of OutofMemory Error. The reason might be that the default executor's memory size is *512M* . I try to modify executor memory size via export SPARK_JAVA_OPTS="-Dspark-cores-max=8 -Dspark.executor.memory=8g", but SPARK_JAVA_OPTS is not recommended in Spark 1.0+. Log also tells ERROR SparkConf. - Anyone knows the difference between executor memory / cores and worker memory /cores? - How to set the executor memory in Spark 1.0+? spark-env.sh: export SPARK_WORKER_MEMORY=2g export SPARK_MASTER_IP=192.168.1.12 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=2 export SPARK_WORKER_INSTANCES=2 Each server has 8G men and 8-core CPU. But after several stages, the job failed and outputs following logs: 14/05/19 22:29:32 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Java heap space 14/05/19 22:29:32 INFO SparkDeploySchedulerBackend: Executor 10 disconnected, so removing it 14/05/19 22:29:32 ERROR TaskSchedulerImpl: Lost executor 10 on host125: remote Akka client disassociat ... 14/05/19 22:29:33 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED 14/05/19 22:29:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(10, host125, java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211) ... 14/05/19 22:29:33 INFO DAGScheduler: Failed to run foreach at Bagel.scala:251 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED 14/05/19 22:29:33 INFO TaskSchedulerImpl: Cancelling stage 4 14/05/19 22:29:33 INFO TaskSchedulerImpl: Stage 4 was cancelled 14/05/19 22:29:33 WARN TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Failed on local exception: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.chan nels.SocketChannel[connected local=/192.168.1.123:54254 remote=/192.168.1.12:9000]. 59922 millis timeout left.; Host Details : local hos t is: "host123/192.168.1.123"; destination host is: "sing12":9000; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) ... Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com