subject:"pyspark sc.parallelize running OOM with smallish data"

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-14 Thread Mohit Jaggi

Continuing to debug with Scala, I tried this on local with enough memory (10g) and it is able to count the dataset. With more memory(for executor and driver) in a cluster it still fails. The data is about 2Gbytes. It is 30k * 4k doubles. On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson wrote: >

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-12 Thread Aaron Davidson

I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi wrote: > I put the same dataset into scala (using spark-

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-11 Thread Mohit Jaggi

I put the same dataset into scala (using spark-shell) and it acts weird. I cannot do a count on it, the executors seem to hang. The WebUI shows 0/96 in the status bar, shows details about the worker nodes but there is no progress. sc.parallelize does finish (takes too long for the data size) in sca

pyspark sc.parallelize running OOM with smallish data

2014-07-11 Thread Mohit Jaggi

spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File "/mohit/./m.py", line 58, in spark_data = sc.parallelize(spark_data_array) File "/mohit/spark/python