Re: How to read a multipart s3 file?

Nicholas Chammas Fri, 16 May 2014 14:15:14 -0700

On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson <ilike...@gmail.com> wrote:


Spark can only run as many tasks as there are partitions, so if you don't
> have enough partitions, your cluster will be underutilized.

 This is a very important point.

kamatsuoka, how many partitions does your RDD have when you try to save it?
You can check this with myrdd._jrdd.splits().size() in PySpark. If it’s
less than the number of cores in your cluster, try repartition()-ing the
RDD as Aaron suggested.

Nick

Re: How to read a multipart s3 file?

Reply via email to