I took a look at docs/configuration.md Though I didn't find answer for your first question, I think the following pertains to your second question:
<tr> <td><code>spark.python.worker.memory</code></td> <td>512m</td> <td> Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>). If the memory used during aggregation goes above this amount, it will spill the data into disks. </td> </tr> On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: > Hello, > > We have an HPC cluster that we run Spark jobs on using standalone mode and > a number of scripts I’ve built up to dynamically schedule and start spark > clusters within the Grid Engine framework. Nodes in the cluster have 16 > cores and 128GB of RAM. > > My users use pyspark heavily. We’ve been having a number of problems with > nodes going offline with extraordinarily high load. I was able to look at > one of those nodes today before it went truly sideways, and I discovered > that the user was running 50 pyspark.daemon threads (remember, this is a 16 > core box), and the load was somewhere around 25 or so, with all CPUs maxed > out at 100%. > > So while the spark worker is aware it’s only got 16 cores and behaves > accordingly, pyspark seems to be happy to overrun everything like crazy. Is > there a global parameter I can use to limit pyspark threads to a sane > number, say 15 or 16? It would also be interesting to set a memory limit, > which leads to another question. > > How is memory managed when pyspark is used? I have the spark worker memory > set to 90GB, and there is 8GB of system overhead (GPFS caching), so if > pyspark operates outside of the JVM memory pool, that leaves it at most > 30GB to play with, assuming there is no overhead outside the JVM’s 90GB > heap (ha ha.) > > Thanks, > Ken Carlile > Sr. Unix Engineer > HHMI/Janelia Research Campus > 571-209-4363 > >