On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson <ilike...@gmail.com> wrote:
Spark can only run as many tasks as there are partitions, so if you don't > have enough partitions, your cluster will be underutilized. This is a very important point. kamatsuoka, how many partitions does your RDD have when you try to save it? You can check this with myrdd._jrdd.splits().size() in PySpark. If it’s less than the number of cores in your cluster, try repartition()-ing the RDD as Aaron suggested. Nick