Brad: did you ever manage to figure this out? We're experiencing similar
problems, and have also found that the memory limitations supplied to the
Java side of PySpark don't limit how much memory Python can consume (which
makes sense).
Have you profiled the datasets you are trying to join? Is the
I set SPARK_MEM in the driver process by setting
"spark.executor.memory" to 10G. Each machine had 32G of RAM and a
dedicated 32G spill volume. I believe all of the units are in pages,
and the page size is the standard 4K. There are 15 slave nodes in the
cluster and the sizes of the datasets I'm
A JVM can easily be limited in how much memory it uses with the -Xmx
parameter, but Python doesn't have memory limits built in in such a
first-class way. Maybe the memory limits aren't making it to the python
executors.
What was your SPARK_MEM setting? The JVM below seems to be using 603201
(pag
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote:
> I am running the latest version of PySpark branch-0.9 and having some
> trouble with join.
>
> One RDD is about 100G (25GB compressed and serialized in memory) with
> 130K records, the other RDD is about 10G (2.5G compressed and
> serialized in