Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Imran Rashid
I think you should also just be able to provide an input format that never splits the input data. This has come up before on the list, but I couldn't find it.* I think this should work, but I can't try it out at the moment. Can you please try and let us know if it works? class TextFormatNoSplit

Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Sean Owen
Yes, that second argument is what I was referring to, but yes it's a *minimum*, oops, right. OK, you will want to coalesce then, indeed. On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV) wrote: > Ø If you know that this number is too high you can request a number of > partitions when you

RE: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Wang, Ningjun (LNG-NPV)
@spark.apache.org Subject: Re: sparkcontext.objectFile return thousands of partitions You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Sean Owen
You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions is determined by Hadoop input splits and generally makes a partition per b

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Noam Barcay
maybe each of the file parts has many blocks? did you try SparkContext.coalesce to reduce the number of partitions? can be done w/ or w/o data-shuffle. *Noam Barcay* Developer // *Kenshoo* *Office* +972 3 746-6500 *427 // *Mobile* +972 54 475-3142 __ *www.Ke