Further data on this. 
I’m watching another job right now where there are 16 pyspark.daemon threads, all of which are trying to get a full core (remember, this is a 16 core machine). Unfortunately , the java process actually running the spark worker is trying to take several cores of its own, driving the load up. I’m hoping someone has seen something like this. 

—Ken
On Mar 21, 2016, at 3:07 PM, Carlile, Ken <carli...@janelia.hhmi.org> wrote:

No further input on this? I discovered today that the pyspark.daemon threadcount was actually 48, which makes a little more sense (at least it’s a multiple of 16), and it seems to be happening at reduce and collect portions of the code. 

—Ken

On Mar 17, 2016, at 10:51 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote:

Thanks! I found that part just after I sent the email… whoops. I’m guessing that’s not an issue for my users, since it’s been set that way for a couple of years now. 

The thread count is definitely an issue, though, since if enough nodes go down, they can’t schedule their spark clusters. 

—Ken
On Mar 17, 2016, at 10:50 AM, Ted Yu <yuzhih...@gmail.com> wrote:

I took a look at docs/configuration.md
Though I didn't find answer for your first question, I think the following pertains to your second question:

<tr>
  <td><code>spark.python.worker.memory</code></td>
  <td>512m</td>
  <td>
    Amount of memory to use per python worker process during aggregation, in the same
    format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>). If the memory
    used during aggregation goes above this amount, it will spill the data into disks.
  </td>
</tr>

On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote:
Hello,

We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework. Nodes in the cluster have 16 cores and 128GB of RAM.

My users use pyspark heavily. We’ve been having a number of problems with nodes going offline with extraordinarily high load. I was able to look at one of those nodes today before it went truly sideways, and I discovered that the user was running 50 pyspark.daemon threads (remember, this is a 16 core box), and the load was somewhere around 25 or so, with all CPUs maxed out at 100%.

So while the spark worker is aware it’s only got 16 cores and behaves accordingly, pyspark seems to be happy to overrun everything like crazy. Is there a global parameter I can use to limit pyspark threads to a sane number, say 15 or 16? It would also be interesting to set a memory limit, which leads to another question.

How is memory managed when pyspark is used? I have the spark worker memory set to 90GB, and there is 8GB of system overhead (GPFS caching), so if pyspark operates outside of the JVM memory pool, that leaves it at most 30GB to play with, assuming there is no overhead outside the JVM’s 90GB heap (ha ha.)

Thanks,
Ken Carlile
Sr. Unix Engineer
HHMI/Janelia Research Campus
571-209-4363



Т���������������������������������������������������������������������ХF�V�7V'67&�&R�R���âW6W"�V�7V'67&�&T7&��6�R��&pФf�"FF�F����6����G2�R���âW6W"ֆV�7&��6�R��&pР

--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to