Setting Executor memory

2015-09-14 Thread Thomas Gerber
Hello, I was looking for guidelines on what value to set executor memory to (via spark.executor.memory for example). This seems to be important to avoid OOM during tasks, especially in no swap environments (like AWS EMR clusters). This setting is really about the executor JVM heap. Hence, in ord

Cores per executors

2015-09-09 Thread Thomas Gerber
Hello, I was wondering how Spark was enforcing to use *only* X number of cores per executor. Is it simply running max Y tasks in parallel on each executor where X = Y * spark.task.cpus? (This is what I understood from browsing TaskSchedulerImpl). Which would mean the processing power used for"ma

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-29 Thread Thomas Gerber
of this RDD Which means the when a job uses that RDD, the DAG stops at that RDD and does not looks at its parents as it doesn't have them anymore. It is very similar to saving your RDD and re-loading it as a "fresh" RDD. On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber wrote: &

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
used by seeing skipped stages in the job UI. They are > periodically cleaned up based on available space of the configured > spark.local.dirs paths. > > From: Thomas Gerber > Date: Monday, June 29, 2015 at 10:12 PM > To: user > Subject: Shuffle files lifecycle > > Hello

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber wrote: > Hello, > > It is my understanding that shuffle are written on disk and that they

Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I use the words job and stage carefully here. 1. can a shuffle created during JobN be used to skip many stages from JobN+1

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on. On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber wrote: > Hello, > > We r

Re: Error communicating with MapOutputTracker

2015-05-15 Thread Thomas Gerber
ed? Not that I'll have any > suggestions for you based on the answer, but it may help us reproduce it > and try to fix whatever the root cause is. > > thanks, > Imran > > > > On Wed, Mar 4, 2015 at 12:30 PM, Thomas Gerber > wrote: > >> I meant spark

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
. When the total amount of reserved memory > (not necessarily resident memory) exceeds the memory of the system it > throws an OOM. I'm looking for material to back this up. Sorry for the > initial vague response. > > Matthew > > On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerb

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber wrote: > Hello, > > I am seeing various crashes in spark

java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it do

Re: Driver disassociated

2015-03-05 Thread Thomas Gerber
getInt("spark.akka.heartbeat.interval", 1000) > > Cheers > > On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber > wrote: > >> Also, >> >> I was experiencing another problem which might be related: >> "Error communicating with MapOutputTracker"

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
Also, I was experiencing another problem which might be related: "Error communicating with MapOutputTracker" (see email in the ML today). I just thought I would mention it in case it is relevant. On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber wrote: > 1.2.1 > > Also, I was

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
. Thanks, Thomas On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu wrote: > What release are you using ? > > SPARK-3923 went into 1.2.0 release. > > Cheers > > On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber > wrote: > >> Hello, >> >> sometimes, in the *middle*

Driver disassociated

2015-03-04 Thread Thomas Gerber
Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in the shell/submit output. When looking at the executor logs, I see logs like this: 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker acto

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber wrote: > Follow up: > We re-retried, this time after *decreasing* spark.parallelism. It was set > to 16000 before, (5 times the number of cores in our cluster). It is now > down to 6400

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
the number of tasks it can track? On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber wrote: > Hello, > > We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). > We use spark-submit to start an application. > > We got the following error which leads to a fai

Spark logs in standalone clusters

2015-03-04 Thread Thomas Gerber
Hello, I was wondering where all the logs files were located on a standalone cluster: 1. the executor logs are in the work directory on each slave machine (stdout/stderr) - I've notice that GC information is in stdout, and stage information in stderr - *Could we get more i

Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Hello, We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). We use spark-submit to start an application. We got the following error which leads to a failed stage: Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, most recent failure: Lost task 3095.

Re: Executors dropping all memory stored RDDs?

2015-02-24 Thread Thomas Gerber
of disk. So, in case someone else notices a behavior like this, make sure you check your cluster monitor (like ganglia). On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber wrote: > Hello, > > I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run > of a big job. >

Shuffle Spill

2015-02-20 Thread Thomas Gerber
Hello, I have a few tasks in a stage with lots of tasks that have a large amount of shuffle spill. I scouted the web to understand shuffle spill, and I did not find any simple explanation of the spill mechanism. What I put together is: 1. the shuffle spill can happens when the shuffle is written