Continuing to debug with Scala, I tried this on local with enough memory
(10g) and it is able to count the dataset. With more memory(for executor
and driver) in a cluster it still fails. The data is about 2Gbytes. It is
30k * 4k doubles.
On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson wrote:
>
I think this is probably dying on the driver itself, as you are probably
materializing the whole dataset inside your python driver. How large is
spark_data_array compared to your driver memory?
On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi wrote:
> I put the same dataset into scala (using spark-
I put the same dataset into scala (using spark-shell) and it acts weird. I
cannot do a count on it, the executors seem to hang. The WebUI shows 0/96
in the status bar, shows details about the worker nodes but there is no
progress.
sc.parallelize does finish (takes too long for the data size) in sca
spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in
the cluster and gave 48g to executors. also tried kyro serialization.
traceback (most recent call last):
File "/mohit/./m.py", line 58, in
spark_data = sc.parallelize(spark_data_array)
File "/mohit/spark/python