Re: [PySpark] large # of partitions causes OOM

2014-09-02 Thread Matthew Farrellee
On 08/29/2014 06:05 PM, Nick Chammas wrote: Here’s a repro for PySpark: |a = sc.parallelize(["Nick","John","Bob"]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) | When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is what I get:

[PySpark] large # of partitions causes OOM

2014-08-29 Thread Nick Chammas
nWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Is this a bug? What’s going on here? Nick ​ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-large-of-partitions-causes-OOM-tp13155.html Sent from the Apache Spark User List mailing list archive at Nabble.com.