based on spark UI I am running 25 executors for sure. why would you expect four? I submit the task with --num-executors 25 and I get 6-7 executors running per host (using more of smaller executors allows me better cluster utilization when running parallel spark sessions (which is not the case of this reported issue - for now using the cluster exclusively)). thx,Antony.
On Thursday, 19 February 2015, 11:02, Sean Owen <so...@cloudera.com> wrote: This should result in 4 executors, not 25. They should be able to execute 4*4 = 16 tasks simultaneously. You have them grab 4*32 = 128GB of RAM, not 1TB. It still feels like this shouldn't be running out of memory, not by a long shot though. But just pointing out potential differences between what you are expecting and what you are configuring. On Thu, Feb 19, 2015 at 9:56 AM, Antony Mayi <antonym...@yahoo.com.invalid> wrote: > Hi, > > I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark > 1.2.0 in yarn-client mode with following layout: > > spark.executor.cores=4 > spark.executor.memory=28G > spark.yarn.executor.memoryOverhead=4096 > > I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a > dataset with ~3 billion of ratings using 25 executors. At some point some > executor crashes with: > > 15/02/19 05:41:06 WARN util.AkkaUtils: Error sending message in 1 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187) > at > org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:398) > 15/02/19 05:41:06 ERROR executor.Executor: Exception in task 131.0 in stage > 51.0 (TID 7259) > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.lang.reflect.Array.newInstance(Array.java:75) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1671) > at > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707) > at > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345) > > So the GC overhead limit exceeded is pretty clear and would suggest running > out of memory. Since I have 1TB of RAM available this must be rather due to > some config inoptimality. > > Can anyone please point me to some directions how to tackle this? > > Thanks, > Antony.