Hello,
I was looking for guidelines on what value to set executor memory to
(via spark.executor.memory for example).
This seems to be important to avoid OOM during tasks, especially in no swap
environments (like AWS EMR clusters).
This setting is really about the executor JVM heap. Hence, in ord
Hello,
I was wondering how Spark was enforcing to use *only* X number of cores per
executor.
Is it simply running max Y tasks in parallel on each executor where X = Y
* spark.task.cpus? (This is what I understood from browsing
TaskSchedulerImpl).
Which would mean the processing power used for"ma
of this RDD
Which means the when a job uses that RDD, the DAG stops at that RDD and
does not looks at its parents as it doesn't have them anymore. It is very
similar to saving your RDD and re-loading it as a "fresh" RDD.
On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber
wrote:
&
used by seeing skipped stages in the job UI. They are
> periodically cleaned up based on available space of the configured
> spark.local.dirs paths.
>
> From: Thomas Gerber
> Date: Monday, June 29, 2015 at 10:12 PM
> To: user
> Subject: Shuffle files lifecycle
>
> Hello
Ah, for #3, maybe this is what *rdd.checkpoint *does!
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Thomas
On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber
wrote:
> Hello,
>
> It is my understanding that shuffle are written on disk and that they
Hello,
It is my understanding that shuffle are written on disk and that they act
as checkpoints.
I wonder if this is true only within a job, or across jobs. Please note
that I use the words job and stage carefully here.
1. can a shuffle created during JobN be used to skip many stages from
JobN+1
Note that this problem is probably NOT caused directly by GraphX, but
GraphX reveals it because as you go further down the iterations, you get
further and further away of a shuffle you can rely on.
On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber
wrote:
> Hello,
>
> We r
ed? Not that I'll have any
> suggestions for you based on the answer, but it may help us reproduce it
> and try to fix whatever the root cause is.
>
> thanks,
> Imran
>
>
>
> On Wed, Mar 4, 2015 at 12:30 PM, Thomas Gerber
> wrote:
>
>> I meant spark
. When the total amount of reserved memory
> (not necessarily resident memory) exceeds the memory of the system it
> throws an OOM. I'm looking for material to back this up. Sorry for the
> initial vague response.
>
> Matthew
>
> On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerb
Additional notes:
I did not find anything wrong with the number of threads (ps -u USER -L |
wc -l): around 780 on the master and 400 on executors. I am running on 100
r3.2xlarge.
On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber
wrote:
> Hello,
>
> I am seeing various crashes in spark
Hello,
I am seeing various crashes in spark on large jobs which all share a
similar exception:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
I increased nproc (i.e. ulimit -u) 10 fold, but it do
getInt("spark.akka.heartbeat.interval", 1000)
>
> Cheers
>
> On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber
> wrote:
>
>> Also,
>>
>> I was experiencing another problem which might be related:
>> "Error communicating with MapOutputTracker"
Also,
I was experiencing another problem which might be related:
"Error communicating with MapOutputTracker" (see email in the ML today).
I just thought I would mention it in case it is relevant.
On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber
wrote:
> 1.2.1
>
> Also, I was
.
Thanks,
Thomas
On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu wrote:
> What release are you using ?
>
> SPARK-3923 went into 1.2.0 release.
>
> Cheers
>
> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber
> wrote:
>
>> Hello,
>>
>> sometimes, in the *middle*
Hello,
sometimes, in the *middle* of a job, the job stops (status is then seen as
FINISHED in the master).
There isn't anything wrong in the shell/submit output.
When looking at the executor logs, I see logs like this:
15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
acto
I meant spark.default.parallelism of course.
On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber
wrote:
> Follow up:
> We re-retried, this time after *decreasing* spark.parallelism. It was set
> to 16000 before, (5 times the number of cores in our cluster). It is now
> down to 6400
the number of tasks it can track?
On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber
wrote:
> Hello,
>
> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
> We use spark-submit to start an application.
>
> We got the following error which leads to a fai
Hello,
I was wondering where all the logs files were located on a standalone
cluster:
1. the executor logs are in the work directory on each slave machine
(stdout/stderr)
- I've notice that GC information is in stdout, and stage information
in stderr
- *Could we get more i
Hello,
We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
We use spark-submit to start an application.
We got the following error which leads to a failed stage:
Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4
times, most recent failure: Lost task 3095.
of
disk.
So, in case someone else notices a behavior like this, make sure you check
your cluster monitor (like ganglia).
On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber
wrote:
> Hello,
>
> I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run
> of a big job.
>
Hello,
I have a few tasks in a stage with lots of tasks that have a large amount
of shuffle spill.
I scouted the web to understand shuffle spill, and I did not find any
simple explanation of the spill mechanism. What I put together is:
1. the shuffle spill can happens when the shuffle is written
21 matches
Mail list logo