There is only one executor on each worker. I see one pyspark.daemon, but
when the streaming jobs starts a batch I see that it spawns 4 other
pyspark.daemon processes. After the batch completes, the 4 pyspark.demon
processes die and there is only one left.
I think this behavior was introduced by th
>>> I am seeing this issue too with pyspark (Using Spark 1.6.1). I have
set spark.executor.cores to 1, but I see that whenever streaming batch
starts processing data, see python -m pyspark.daemon processes increase
gradually to about 5, (increasing CPU% on a box about 4-5 times, each
pyspark.daemo
Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16 core
(virtual cores?), grid engine allocates 16 slots, If you use 'max' scheduling,
it will send 16 processes sequentially to same machine, on the top of it each
spark job has its own executors. Limit the number of jobs sc
(not from each machine). If not set, the default will be
> spark.deploy.defaultCores on Spark's standalone cluster manager, or
> infinite (all available cores) on Mesos.”
>
>
>
> *David Newberger*
>
>
>
> *From:* agateaaa [mailto:agate...@gmail.com]
> *Sent:* Wednesday, June 15,
to:agate...@gmail.com]
Sent: Wednesday, June 15, 2016 4:39 PM
To: Gene Pang
Cc: Sven Krasser; Carlile, Ken; user
Subject: Re: Limit pyspark.daemon threads
Thx Gene! But my concern is with CPU usage not memory. I want to see if there
is anyway to control the number of pyspark.daemon processes that
Thx Gene! But my concern is with CPU usage not memory. I want to see if
there is anyway to control the number of pyspark.daemon processes that get
spawned. We have some restriction on number of CPU's we can use on a node,
and number of pyspark.daemon processes that get created dont seem to honor
sp
As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory,
and you can then share that RDD across different jobs. If you would like to
run Spark on Alluxio, this documentation can help:
http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html
Thanks,
Gene
On Tue,
Hi,
I am seeing this issue too with pyspark (Using Spark 1.6.1). I have set
spark.executor.cores to 1, but I see that whenever streaming batch starts
processing data, see python -m pyspark.daemon processes increase gradually
to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daem
Hey Ken,
1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap
storage option using Alluxio, formerly Tachyon, with which I have no
experience however.)
2. The worker memory setting is not a hard maximum unfortunately. What
happens is that during aggregation the Python daemon
This is extremely helpful!
I’ll have to talk to my users about how the python memory limit should be adjusted and what their expectations are. I’m fairly certain we bumped it up in the dark past when jobs were failing because of insufficient memory for the python processes.
So just
My understanding is that the spark.executor.cores setting controls the
number of worker threads in the executor in the JVM. Each worker thread
communicates then with a pyspark daemon process (these are not threads) to
stream data into Python. There should be one daemon process per worker
thread (bu
Thanks, Sven!
I know that I’ve messed up the memory allocation, but I’m trying not to think too much about that (because I’ve advertised it to my users as “90GB for Spark works!” and that’s how it displays in the Spark UI (totally ignoring the python processes).
So I’ll need to deal w
Hey Ken,
I also frequently see more pyspark daemons than configured concurrency,
often it's a low multiple. (There was an issue pre-1.3.0 that caused this
to be quite a bit higher, so make sure you at least have a recent version;
see SPARK-5395.)
Each pyspark daemon tries to stay below the config
Further data on this.
I’m watching another job right now where there are 16 pyspark.daemon threads, all of which are trying to get a full core (remember, this is a 16 core machine). Unfortunately , the java process actually running the spark worker is trying to take
several cores of its o
No further input on this? I discovered today that the pyspark.daemon threadcount was actually 48, which makes a little more sense (at least it’s a multiple of 16), and it seems to be happening at reduce and collect portions of the code.
—Ken
On Mar 17, 2016, at 10:51 AM, Carlile,
I took a look at docs/configuration.md
Though I didn't find answer for your first question, I think the following
pertains to your second question:
spark.python.worker.memory
512m
Amount of memory to use per python worker process during aggregation,
in the same
format as JVM memory
Thanks! I found that part just after I sent the email… whoops. I’m guessing that’s not an issue for my users, since it’s been set that way for a couple of years now.
The thread count is definitely an issue, though, since if enough nodes go down, they can’t schedule their spark clusters.
17 matches
Mail list logo