maybe each of the file parts has many blocks? did you try SparkContext.coalesce to reduce the number of partitions? can be done w/ or w/o data-shuffle.
*Noam Barcay* Developer // *Kenshoo* *Office* +972 3 746-6500 *427 // *Mobile* +972 54 475-3142 __________________________________________ *www.Kenshoo.com* <http://kenshoo.com/> On Wed, Jan 21, 2015 at 5:31 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Why sc.objectFile(…) return a Rdd with thousands of partitions? > > > > I save a rdd to file system using > > > > rdd.saveAsObjectFile(“file:///tmp/mydir”) > > > > Note that the rdd contains 7 millions object. I check the directory > /tmp/mydir/, it contains 8 partitions > > > > part-00000 part-00002 part-00004 part-00006 _SUCCESS > > part-00001 part-00003 part-00005 part-00007 > > > > I then load the rdd back using > > > > val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8) > > > > I expect rdd2 to have 8 partitions. But from the master UI, I see that > rdd2 has over 1000 partitions. This is very inefficient. How can I limit it > to 8 partitions just like what is stored on the file system? > > > > Regards, > > > > *Ningjun Wang* > > Consulting Software Engineer > > LexisNexis > > 121 Chanlon Road > > New Providence, NJ 07974-1541 > > > -- This e-mail, as well as any attached document, may contain material which is confidential and privileged and may include trademark, copyright and other intellectual property rights that are proprietary to Kenshoo Ltd, its subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be read, copied and used only by the addressee for the purpose(s) for which it was disclosed herein. If you have received it in error, please destroy the message and any attachment, and contact us immediately. If you are not the intended recipient, be aware that any review, reliance, disclosure, copying, distribution or use of the contents of this message without Kenshoo's express permission is strictly prohibited.