I took a look at docs/configuration.md
Though I didn't find answer for your first question, I think the following
pertains to your second question:

<tr>
  <td><code>spark.python.worker.memory</code></td>
  <td>512m</td>
  <td>
    Amount of memory to use per python worker process during aggregation,
in the same
    format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
If the memory
    used during aggregation goes above this amount, it will spill the data
into disks.
  </td>
</tr>

On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken <carli...@janelia.hhmi.org>
wrote:

> Hello,
>
> We have an HPC cluster that we run Spark jobs on using standalone mode and
> a number of scripts I’ve built up to dynamically schedule and start spark
> clusters within the Grid Engine framework. Nodes in the cluster have 16
> cores and 128GB of RAM.
>
> My users use pyspark heavily. We’ve been having a number of problems with
> nodes going offline with extraordinarily high load. I was able to look at
> one of those nodes today before it went truly sideways, and I discovered
> that the user was running 50 pyspark.daemon threads (remember, this is a 16
> core box), and the load was somewhere around 25 or so, with all CPUs maxed
> out at 100%.
>
> So while the spark worker is aware it’s only got 16 cores and behaves
> accordingly, pyspark seems to be happy to overrun everything like crazy. Is
> there a global parameter I can use to limit pyspark threads to a sane
> number, say 15 or 16? It would also be interesting to set a memory limit,
> which leads to another question.
>
> How is memory managed when pyspark is used? I have the spark worker memory
> set to 90GB, and there is 8GB of system overhead (GPFS caching), so if
> pyspark operates outside of the JVM memory pool, that leaves it at most
> 30GB to play with, assuming there is no overhead outside the JVM’s 90GB
> heap (ha ha.)
>
> Thanks,
> Ken Carlile
> Sr. Unix Engineer
> HHMI/Janelia Research Campus
> 571-209-4363
>
>

Reply via email to