subject:"Re\: trouble with \"join\" on large RDDs"

Re: trouble with "join" on large RDDs

2014-04-14 Thread Harry Brundage

Brad: did you ever manage to figure this out? We're experiencing similar problems, and have also found that the memory limitations supplied to the Java side of PySpark don't limit how much memory Python can consume (which makes sense). Have you profiled the datasets you are trying to join? Is the

Re: trouble with "join" on large RDDs

2014-04-09 Thread Brad Miller

I set SPARK_MEM in the driver process by setting "spark.executor.memory" to 10G. Each machine had 32G of RAM and a dedicated 32G spill volume. I believe all of the units are in pages, and the page size is the standard 4K. There are 15 slave nodes in the cluster and the sizes of the datasets I'm

Re: trouble with "join" on large RDDs

2014-04-09 Thread Andrew Ash

A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201 (pag

Re: trouble with "join" on large RDDs

2014-04-07 Thread Patrick Wendell

On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote: > I am running the latest version of PySpark branch-0.9 and having some > trouble with join. > > One RDD is about 100G (25GB compressed and serialized in memory) with > 130K records, the other RDD is about 10G (2.5G compressed and > serialized in