One way to ensure Spark writes more partitions is by using RDD#repartition() to make each partition smaller. One Spark partition always corresponds to one file in the underlying store, and it's usually a good idea to have each partition size range somewhere between 64 MB to 256 MB. Too few partitions leads to other problems, such as too little concurrency -- Spark can only run as many tasks as there are partitions, so if you don't have enough partitions, your cluster will be underutilized.
On Tue, May 6, 2014 at 7:07 PM, kamatsuoka <ken...@gmail.com> wrote: > Yes, I'm using s3:// for both. I was using s3n:// but I got frustrated by > how > slow it is at writing files. In particular the phases where it moves the > temporary files to their permanent location takes as long as writing the > file itself. I can't believe anyone uses this. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5470.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >